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A Trade Unionist Looks at Job Evaluation * 


William Gomberg 
Management Engineering Department, International Ladies’ Garment 
Workers’ Union, New York, N. Y. 


A trade unionist’s attitude toward job evalu- 
ation is largely governed by his estimate of its 
effectiveness as a collective bargaining tool. 
Collective bargaining is the embodiment of 
democratic practice by which workers exercise 
a voice in their working conditions. Job evalu- 
ation is a subordinate gimmick. The interest 
of the trade union in the relative differentials 
received by workers on different jobs is appar- 
ent to anyone possessing some insight. No 
union will be content to negotiate blanket base 
wages and then leave the distribution of the 
relative increments exclusively to management. 
Discussions about the relative effectiveness of 
job evaluation programs with or without union 
participation makes about as much sense to 
the average trade unionist as a tract upon the 
relative effectiveness of marriages with or with- 
out grooms. 

Trade unions are suspicious of proposals to 
correct large inequities particularly in first ne- 
gotiations. Somehow, many of these inequi- 
ties, it has been discovered, arise from the 
strategic increase. The foreman reports where 
the union is making organization progress dur- 
ing the campaign. “Inequities” are discovered 
in this department and increases ordered. In 
this way relative wages reflect the attempt to 
defeat the union. The union when asked to 
agree to correcting inequities in such a situation 

* The editor invited William Gomberg, Ph.D. in In- 
dustrial Engineering, to prepare this paper for the 
Journal of Applied Psychology because of the interest 
aroused by his cémmentary at the joint session of the 
Division of Industrial Psychology and the Industrial 
Relations Association of America at the September 1949 
meeting of the APA in Denver, Colorado, and the 
critique of job evaluation presented by him at the Con- 
ference on Job Evaluation held in Minneapolis in 
December 1949 under the sorship of the Industrial 


Relations Center of the University of Minnesota.— 
Editor. 


is likely to reply: “‘You were satisfied with non- 
union confusion for a spell. Show us your good 
faith by putting up with trade union chaos for 
a while. We'll discuss it in future negoti- 
ations.” Later on, of course, these inequities 
raise internal problems for the union as rela- 
tionships between the union and management 
become stabilized. It is then that the union 
may be ready to discuss job evaluation. 

Psychologists have now entered the job 
evaluation field. It is no longer the ex- 
clusive province of the industrial engineer. 
Before a trade unionist can examine or evalu- 
ate intelligently their contribution to the field, 
it might be useful to delve into the very founda- 
tions upon which the technique has been built. 
It will, therefore, be my intention to examine 
in some detail two principal questions: (1) How 
does the unionist’s concept of the function of 
job evaluation compare with the set of ideas 
held by management; and (2) How closely do 
the measuring techniques developed by both 
psychologists and engineers actually perform 
the function claimed on their behalf by practi- 
tioners in the field? 

A job evaluation program will generally be 
proposed to a union as a technique for rational- 
izing the wage structure. It is based upon the 
western philosophy that men should be com- 
pensated according to the worth of the job they 
perform. Different jobs call for men with dif- 
ferent capacities. Management feels that dif- 
ferentials in income should reflect the demands 
that these different jobs impose upon different 
men’s capacities. This is at least what they 
say. The trade unionist is not so sure that 
they mean it. 

We can all agree that the wage scale should 
begin at the bottom of the ladder with the job 
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that makes the fewest demands upon anybody’s 
capacity. Let’s pick such a job at random, say 
that of a janitor. Now any man who takes a 
janitor’s job brings to it the very minimum ca- 
pacities specified in most evaluation schemes. 
But let us compare the capacities required by 
the janitor’s job with those required by the 
president of the giant corporation, the modern 
industrial genius. 

Morris Viteles has some interesting things to 
say about the relative capacities of the two 
groups from which respective candidates for 
these positions are drawn. He wrote “The 
difference between the general intelligence re- 
quired of the janitor and that demanded of a 
highly skilled worker or top superior appears to 
be well nigh limitless. Actually in terms of 
numerical values, the general intelligence of the 
successful employee in such a top job is seldom 
found to be more than three times that of the 
most stupid worker in the least responsible job. 
This ratio of 1-3 between the extremes of 
ability and an even lower ratio of less than 1-2 
for physical measurements, measures of motor 
function, etc., recur with striking frequency in 
studies of individual differences in ability, skill 
and other human traits. Very seldom is the 
ratio greater than 5-1” (22). 

The implications of these conclusions for 
job evaluation are quite obvious. No job can 
be worth more than the maximum capacities a 
man is expected to bring to a job. On the 
other hand no job can be worth less than the 
very marginal capacities a minimum human 
being rnust of necessity bring to a job. If the 
president of a great corporation seriously be- 
lieves that he wants the pay scale based upon 
the relative objective value of a job, then he is 
obliged to take for himself no more than five 
times the hourly rate which he pays to the 
lowliest employee in his establishment. 

The trade unionist has discovered that the 
corporation executive will not be satisfied with 
such an arrangement. It is fruitless to expect 
complete rationalism in a wage policy when 
almost all other economic policies possess irra- 
tional elements. Pricing policies are not al- 
ways based upon cost. De luxe models of 
appliances are marked up more because of the 
average consumer’s willingness to pay more 
rather than in proportion to the extra labor 
cost and materials in de luxe models. It is for 


reasons like this that trade unionists are ex- 
ceedingly suspicious of this desire to rationalize 
only one element in the economic picture, the | 
wage structure. Suppose that $300.00 were 
to be divided between two men rationally. 
One man possessed a relatively low intelli- 
gence, the other man a relatively high intelli- 
gence. The former was phlegmatic and in- 
different, the latter aggressive and ambitious. 
As Viteles remarks, the ratio of this combina- 
tion of traits could be at the most 1-3. Would 
the distribution of this income be divided in the 
ratio of $75.00 to $225.00 or would the man 
with the high pugnacity coefficient and the 
intelligence to match it go off with the whole 
$300.00? His intelligence might lead him to 
leave the victim about $10.00 to make sure that 
his rival would not be goaded into revolt. Now 
just multiply the man with the low pugnacity 
coefficient by millions and we have a much bet- 
ter explanation of why incomes are distributed 
the way that they are than any high flown 
theories about relative contributions to society. 

Attempts to rationalize this distribution can 
be made. For one thing we are told that exec- 
utive jobs can’t be measured on the same scales 
used for factory jobs. This argument of course 
is confined to advocates of a point system. 
Advocates of the factor comparison method 
assure us that they can use their system for 
jobs right up to $30,000.00 per year (5). Non- 
linear wage curves resembling the exponential 
function are presented to justify the steep in- 
creases in wages that are presented to the pub- 
lic. However, although recognizable ability 
increments are recognized in geometrical rela- 
tionships, the overall ratio between top and 
bottom still can not or should not exceed the 
ratio of 1-3 or 1-5. Compare C. E. Wilson’s 
salary of General Motors with that of a porter 
and what do you get? The point plan advo- 
cates who inform us that plans like the NEMA 
are not adapted to the measurement of execu- 
tive positions, nevertheless maintain that their 
rigid common measuring scale can measure the 
relative requirements of sewer workers and tool 
makers. More about these measuring tech- 
niques later. The only rationale that can be 
advanced to justify such a curve is that it re- 
flects the small “genius supply” in the popula- 
tion. Supply and demand, however, does not 
enter into job evaluation plans. 
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You may now assume that I am opposed to 
job evaluation. But that is not the case at all. 
I do not believe, and I do not think that 
management believes, that it is the function of 
job evaluation to compensate workers in ac- 
cordance with the so-called value of the job. 
I do not believe that job evaluation can be 
used as the sole determinant of how to build a 
relative wage structure. It is but one of the 
many factors that enter the collective bargain- 
ing picture in fixing a final wage scale. It isa 
device to measure relative job content and 
nothing more. This relative content is just 
one of many factors that contribute to the 
building of the final wage scale. 

Some of the other factors that I have in mind 
are as follows: (a) irregularity of employment; 
(b) the career prospects of the job; (c) supply 
and demand; and (d) the traditional prestige 
carried by the job in the plant social system. 

For example, two jobs “A” and “B” may 
carry the same point assignment. ‘A’ job 
exists only during the tooling up period. “B” 
continues right through the production season. 
A union may ask for more money for job “A” 
for this reason. Then again, what is it that 
leads a young chemical engineer with all his 


training to work for less wages than a plumber? 
Some of it, of course, is a rather silly concept of 
a contradiction between unionization and pro- 


fessional status. The other part is the young 
engineer’s willingness to pay back & part of his 
legitimate return ir: the hope of serving his ap- 
prenticeship for what he hopes will be works 
manager. 

Supply and demand is a factor that is often 
overlooked. Naturally unions exist to protect 
workers against being victimized by being com- 
pelled to compete with one another like so many 
bushels of wheat; but let’s take a look at another 
part of the picture. During the war the War 
Labor Board had fixed the Wages of foundry 
hands at seventy-five cents perhour. This was 
done on the basis of a job evaluation plan ap- 
proved by the board. There was no strike. 
The supply of foundry hands disappeared as 
the old foundry hands soughtemployment else- 
where. Of even further interest, was the fact 
that increased wages failed to increase the sup- 
ply of foundry hands. Foundry hands tradi- 
tionally had been at the bottom of the factory 


social ladder and nobody wanted to stay there 
even at an increased price. 

A further investigation of the interfering 
influence of the factory’s social system with job 
evaluation conclusions are seen in what revalu- 
ation sometimes does to the promotional se- 
quence in a factory. A job at the very top of 
the promotional ladder is devalued and with it 
all the aspirations of a group of men who had 
hoped to occupy that job. After all, the fun- 
damental purpose of job evaluation is to estab- 
lish a mutually acceptable criterion of equity. 
If both worker and supervisor agree that, for 
example, the cementers are the aristocrats of 
the raincoat industry, what useful purpose is 
served by upsetting this scale of values in favor 
of some mechanistic criterion of equity? 
These traditions are every bit as important as 
job content. What this all adds up to is that 
in erecting a structure of relative wages there 
will be any number of rates which must be con- 
sidered. There are: (1) The job evaluation 
rate based on relative job content; (2) the com- 
parative rate for the same job in other indus- 
tries in the same area; (3) the comparative rate 
for the same job in the same industry in the 
same area; (4) the comparative rate for the 
same job in the same industry in other areas; 
and (5) the comparative rate for the same job 
in other industries in other areas. 

For example, suppose the rate for a machinist 
in the “X” automobile factory in Squeedunk 
is $1.65 whereas the rate for a machinist in the 
“Y”’ textile machinery works in Squeedunk is 
only $1.25. Again the machinists’ rate in the 
rival “Z” automobile factory in Squeedunk is 
$1.75, whereas the machinists’ rate for the 
“N”’ automobile works in Podunk is $1.95. 
Still again, the machinists’ rate in the ‘‘M” 
textile works in Podunk is $1.45. Thus, for 
the single job of machinist we have five sepa- 
rate rates, any one of which'can be justified 


,upon some concept of equity. The job evalu- 


ation rate is only one of these five. Collective 
bargaining is the method used to satisfy both 
parties that there has been equitable consider- 
ation of rival claims. 

Now that we have established that job evalu- 
ation is just one subordinate tool of the collect- 
ive bargaining process, let us proceed to an 
examination and evaluation of the different 
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methods proposed for measuring relative job 
content. 

It would be fruitless to attempt to examine 
all job evaluation techniques in the space at my 
disposal. I should like to confine my remarks 
to comments on the point and factor compari- 
son techniques. It has been interesting to 
note the contributions of psychologists to this 
field. Lawshe and his students, Bellows, 
Chesler, Edwards, Hay, Otis, Rogers, and 
Turner have examined existing systems and 
proposed others. In the course of commenting 
upon these different plans it will be the writer’s 
intention to provide a trade unionist’s reaction 
to their significance. 

The most widespread plans in use today are 
the point plans such as the National Electrical 
Manufacturers Association (NEMA) plan. 
This plan lists eleven factors and weights each 
factor on the theory that this weighting deter- 
mines the contribution of that particular factor 
to the final result. This weighting is supposed 


to determine the effect of the corresponding 
factor in the final relative distribution of these 
jobs. Tiffin (20, pp. 340-346) disclosed the 
fundamental fallacy in all additive point plans 
in his comments on merit rating. He pointed 


out that it was not the maximum ranges of pos- 
sible points that determined the relative 
weights of any one factor in the final re- 
sult but the variability of the factor. 

For example, suppose that I want to weight 
skill and working conditions equally. I allow, 
a possible range of fifty points for each factor. 
However, suppose that I rate five jobs as fol- 
lows: 

1 2 5 
Skill 5 45 
Working conditions 20 30 


I have actually used a range of 40 points for 
skill and a range of 10 points for working condi- 
tions. The relative weights contributed by the 
40 points will have approximately four times 
the effect of the 10 points contributed by work- 
ing conditions. Yet, workers on job evalu- 
ation committees as well as technicians have 
been under the impression that-each factor was 
playing a role equal to its pre-assigned weight 
in the final determination of the result. 

Using the Thurstone factor analysis tech- 
nique, Lawshe and Satter (9, p. 197) demon- 


strated the uselessness of the NEMA Plan for 
its avowed purposes when they concluded that, 
“While there is considerable agreement from 
plant to plant insofar as the presence of factors 
is concerned, there is variation in the extent to 
which they contribute to ‘total point’ ratings 
and consequently, to the existing wage struct- 
ure. . . . It is clear that the extent to which 
each item or factor contributes to the total can 
not be determined by inspection of the scale 
alone and that the end result may yield results 
different from those intended by the makers of 
the scale.” 

When they point out that under the NEMA 

plan, skill demands varied from 77.5 per cent in 
one plant to 94 per cent in another, I submit 
that this is proof that the NEMA experts 
simply do not know what they are doing. It 
is for this reason that I am unable to under- 
stand the purposes of some of the other studies 
conducted by Lawshe. Job evaluation Study 
2 (10) is an attempt to prove that other factor 
job evaluation systems will give substantially 
the same results as the more complex technique. 
Lawshe concludes from this study (10, p. 184), 
“Tf the three item abbreviated scale were em- 
ployed in Plant A, 62 per cent of the jobs would 
remain in the same labor grade, 37.2 per cent 
would be displaced one labor grade and 0.8 per 
cent would be displaced two labor grades. . 
A simplified scale consisting of three or four 
items would probably yield results that are 
practicatly identical with those obtained by a 
more complex system and would greatly reduce 
the time consumed by the rating activity.” 

I can not take issue with these conclusions 
but I cannot help feeling that they are trivial. 
Since Lawshe has proved that the operators of 
the NEMA system only know vaguely, if at all, 
what they are doing, how important is it to 
conclude that there is a more economical way 
to do the same thing? Yet Study 3 (11), 
Study 4 (12), Study 6 (14) and Study 8 (16) 
yield substantially the same conclusions as 
Study 2, and therefore, are of as little signifi- 
cance as Study 2. Study 8 concentrates on 
the reliability of raters using the abbreviated 
scales. The conclusions of Lawshe in Study 2 
have been independently confirmed by Rogers 
(18) and Chesler (3). However, Study 7 (15) 
does raise the question of a suitable criterion 
against which to measure some of the results 
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obtained from these abbreviated scales. Law- 
she, Dudek, and Wilson conclude that (15, p. 
128) “No conclusions about validity can be 
drawn from this study due to the lack of a 
suitable criterion. . . . Although short job 
evaluation systems consisting of only a few 
items may be statistically and logically justi- 
fied, it may be practically advantageous to in- 
clude additional items in the system which will 
make it more acceptable to raters and to em- 
ployees.” 

One of the principal complaints about point 
job evaluation systems has been their tendency 
to play down working conditions and physical 
hazards as important factors. Perhaps a more 
constructive approach would be to show both 
parties how to cease making old mistakes rather 
than claiming the old mistakes can now be 
made with less effort. But then, perhaps it 
helps to keep up the pretense lest the parties 
learn what they are doing. 

Study 5 (13) by Lawshe and Wilson applies 
the Thurstone centroid technique to the factor 
comparison method. They discovered that of 
the five factors considered in the evaluation only 
two were of any significance. Skill demands 
accounted for 98 per cent of the final variation 


in job rates. Physical requirements and work- 
ing conditions accounted for the remaining 2 


per cent. However, since the factor compari- 
son method makes no pretense at weighting 
different factors on any fixed scale, the con- 
clusion that two factors ‘would have been ade- 
quate in this case is of little significance. 
How would you know what two factors to 
use? The factor comparison method is rooted 
in the existing wage structure of the firm. One 
of its main purposes is to derive the distribu- 
tions of weights. It is only when this has been 
done that one can know the particular domin- 
ant factors in the situation. In the paper mill 
it was skill. In the foundry, perhaps a domin- 
ant factor would be working conditions. We 
can not know until we’ve derived the scales. 
Trade unionists have been somewhat mysti- 
fied by the argument between Turner (21) and 
Hay (7) over the comparative merits of each 
other’s per cent variation of the factor compari- 
son method. The principal virtue of the factor 
comparison method when weights were distri- 
buted in money terms was that it lent itself 
readily to collective bargaining purposes. Key 


jobs could be defined as those jobs which both 
parties agreed were being paid what they ought 
to be paid. Computations were in terms of 
money or ranks rather than abstract percent- 
ages. Although Hay called it a commonplace 
fallacy (8) in job evaluation, trade unionists 
nevertheless insisted that the key jobs fixed 
the evaluated rate curve. 

Briefly the vertical axis was labeled “what 
ought to be paid”; the horizontal axis, “what 
is being paid.” All the key jobs fell on the 45 
degree line by definition. Every one knew 
what he was doing. Attempts by consultants 
to impose a least squares line wage curve upon 
unions was resisted. The slope of these lines 
always was less than 45 degrees. This least 
squares line was defined for what it was, the 
inequity line. The difference between this 
least squares line and the 45 degree line merely 
represented the area of past inequities. Hay 
warns that failure to make use of the least 
squares line would add substantially to the 
existing payroll. It would not achieve the 
desired purpose of redistributing the existing 
payroll. No union is interested merely in re- 
distributing a payroll. Future increases may 
be accepted on a differential basis to correct 
old inequities, but no union is interested in 
robbing Peter of what he already has to pay 
Paul in the future. 

Real collective bargeining took place over 
the definition of key jobs because everyone 
readily understood their significance. Perhaps 
the key to the situation rests in the statement 
of Hay (7) “there are situations in which it is 
inadvisable to rely on salary or wage rates of 
the key jobs. This is most often for reasons 
of strategy or policy, such as when a joint 
union-management committee is doing the 
evaluation.” 

This is rather a strange confession that 
obfuscation is good management strategy in 
dealing with a union. The energy that is 
going into this dispute for obfuscation honors 
could be more constructively turned to other 
uses. 

The work of Chesler (2) has brought to- 
gether comparisons between the point job 
evaluation and the factor comparison method. 
He concludes (2, p. 474) that “‘intercorrelations 
among six different company job evaluation 
systems ranged from 0.89 to 0.97 with a mean 
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of 0.94. These six systems included two factor 
comparison systems with five factors each, two 
point ratings with fifteen factors each, one 
point rating system with 12 factors and one 
ranking system. The results indicate a high 
degree of commonality among different job 
evaluation systems.” The standard jobs upon 
which Chesler discovered their commonality 
were all clerical, administrative and supervisory 
positions. It would be interesting to discover 
whether the same commonality would exist if 
jobs were selected from widely varied situations 
such as Open Hearth Furnace Tender, Foun- 
dry hands and assembly worker. The import- 
ance of one job depends on working conditions, 
another on physical hazard, and the third on 
manipulative dexterity. Working conditions 
and physical hazards were completely absent 
from the office jobs measured by Chesler. 

Bellows and Estep (1) have attempted to 
utilize the United States Employment Service 
Occupational Characteristics check list for job 
evaluation purposes. They discovered that 
they secured a correlation coefficient of 0.74 
between the results from a job evaluation plan 
with the OCCL. The method described, how- 
ever, makes it difficult to understand what was 
being done. The actual spread of points be- 
tween the lowest and highest job on the job 
evaluation scheme was 18 to 523 points. The 
actual spread on the OCCL was 10 to 62. In 
other words, the evaluation ratio of 1-19 is 
compared with the 1-6 ratio of the OCCL. 
Furthermore, the OCCL was used in terms of 
three degrees. Every characteristic received 
a listing of at least 1. This is the number as- 
signed when the job requires less from the per- 
son than that found in the highest thirty per 
cent of the population (19). No characteristic 
received more than 3. Thus, the ratio for the 
sum of these characteristics from highest to 
lowest cannot exceed the ratio of 1-3. Yet the 
Bellows-Estep ratio is 1-6.2. 

I would say that the OCCL criterion alone 
discloses interesting possibilities. It estab- 
lished a maximum range between porter and 
corporation president of 1-3. It conforms to 
Viteles’ concept of the maximum range of 
human capacities and may, therefore, be more 
valid than the criterion chosen by Bellows and 
Estep against which to match it. 

Otis (17) rejects both the USES Occupa- 


tional Characteristics check list and Viteles’ 
Job Psychograph for purposes of wage ad- 
ministration. He argues that job evaluation 
should be based upon factors which define the 
difficulty of the job, not on factors which de- 
scribe the qualifications necessary for success- 
ful performance. 

This makes as much sense as setting the 
range of sizes in manufacturing clothing with- 
out any regard for the variation in dimensions 
of human wearers of clothes. People bring 
their capacities to their jobs. The most menial 
job, if it must be done, must be credited with 
minimum human capacities. On the other 
hand, the most exalted job cannot demand more 
of a man than the maximum capacities he 
brings to the position. 

Hay (6) believes that Weber’s law is applic- 
able to job evaluation. He feels that evalu- 
ation grades should be in a geometric progres- 
sion of 15 per cent intervals. He then tells us 
that in office jobs (6, p. 3) up to $4,000.00 or 
$5,000.00 per annum the skill scale will have 
about 25 steps with a 15 per cent increasing 
value for each step. Thus the range in capaci- 
ties according to Hay are no longer 1-3 but 
1:(1.15)* or 1.15 to the 24th power. This is 
approximately 1:35. 

In my opinion, the most significant job 
evaluation plan in use today is the plan govern- 
ing the inequity wage adjustment program of 
the Steel Corporations and the United Steel 
Workers, €.1.0. Paul Edwards (4) who de- 
veloped the plan used a modification of the 
factor comparison method to tailor a system to 
the needs of the Steelworkers. The method 
that he has used to root the plan in the existing 
wage structure of the industry calls for the use 
of correlation mathematics beyond the capacity 
of most workers to understand. However, 
the operation of the plan calls for simple rank 
additions and is readily understood by anybody 
with an elementary school knowledge of simple 
addition. The correlation mathematics was 
required to derive the weights for different 
factors resulting from past wage practices and 
collective bargaining experiences. He com- 
ments that (page 161) “The actual rates de- 
veloped by supply and demand and bargaining 
in years past have recognized the nature of the 
problem better than empirical job evaluation 
plans. The principal advantage of the plan 
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is that it makes it possible to pursue systematic- 
ally in the future the same set of values that 
have governed both parties initially in the 
past.” 

The validity of the plan is based upon the 
satisfaction of collective bargaining experi- 
ences. Factor comparison methods of this 
nature lend themselves readily to such flexible 
operation and are therefore superior to point 
plans with their fixed ranges and unpredictable 
variability. I believe that the most fruitful 
research will come from this type of orienta- 
tion. 


Summary 


The trade unionist looks upon job evaluation 
as a subordinate tool in collective bargaining. 
It does not determine what a job is worth, it 
determines a limited concept of job content. 

The final evaluation rates can only be one 
factor in determining what the relative wage 
structure should be. 

Most job evaluation plans are exceedingly 
defective in measuring job content. Most ab- 
breviated plans perform the same function 
more economically but are equally defective. 

The most useful work in job evaluation is 
research designed to isolate the factors that 
have governed the intuitive operation of col- 
lective bargaining as each party sought its own 
concept of equity. These factors can then be 
used for future guidance. 


Received February 23, 1950. 
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Adjusting Base Weights in Job Evaluation 


J. Stanley Gray 
University of Georgia 


When a system of job evaluation is validated 
by calculating the critical ratio of the difference 
between the percentage of salary and the per- 
centage of evaluation points,’ it is important 
that the ratios of the evaluation points of vari- 
ous key jobs be the same as the ratios of the 
salaries of those same jobs.? If one key job 
carries double the salary of another key job, it 
is necessary that the evaluation points also be 
double. If this ratio is not the same, the differ- 
ences between the percentages of salary and 
the percentages of evaluation points carried 
by the various key jobs will be significant and 
the evaluation system invalid. However, by 
adding a common base value to the evaluation 
points of each job the ratiés can often be made 
to correspond to those of salaries and the per- 
centage differences between salary and points 
may cease to be significant. 

For example, key job A in a certain estab- 
lishment carries a wage of $60.00 per week and 
key job B carries a wage of $102.00 per week. 
This is a ratio of 1 to 1.7. When these jobs 
were evaluated, job A was given 210 points 
and job B 475 poirts. ~ This is a ratio of 1 to 
2.26. These are the two extreme key jobs in 
salary (highest and lowest) and also in evalu- 
ation points. Now, so long as these jobs do 
not have the same ratios in evaluation point 
values as they have in wage values, the differ- 
ences between the percentages of points and of 
wages will be significant. (See Table 1.) 
Before discarding the system of job evaluation 
here used for its apparent lack of validity (this 
may be necessary later on), it may be possible 
to add a common base weight to the values of 
both jobs and make the value ratios approxi- 
mately equal to the salary ratios. INow, what 
value can be added to the point values, 210 


' Gray, J. S., Custom made systems of job evaluation. 
J. appl. Psychol., 1950, 34, 378-380. 

* Two definitions are basic in this paper. 
tion is a determination of the value of each job in 


Job evalua- 


relation to other jobs on the same payroll. Key jobs 
are those that now carry proper wage rates in relation 
to the size of the total payroll. 


and 475, to make their ratio equal to the wage 
ratio? This question can be answered by 
solving a simple proportion equation. It may 
be stated as follows: 


60: 102: :210 + X:475-+X, 
or 102 (210 + X) = 60 (475 + X). 


The value of X is 170. When this value is 
added to the point values of each job, their 
evaluation points become 380 and 645 respec- 
tively. This ratio is now the same as that of 
the salaries, namely, 1 to 1.7. This means, of 
course, that 170 must be added to the point 
values of all key jobs as well as all others evalu- 
ated, if the system is retained. The base value 
170 now becomes a part of the evaluation 
system. 

In the illustration just given, note also that 
the exact value of each evaluation point, or the 
point value of each dollar, can be calculated. 
The point value is calculated by dividing 
$60.00 by 380, or $102.00 by 645; and the 
dollar value is calculated by dividing 380 by 
$60.00, or 645 by $102.00. Actually the above 
system was found to be valid and the formula 
for calculating the proper weekly wage for each 
job evaluated was: point value, plus 170, times 
$0.158, or point value, plus 170, divided by 6.3. 

It is not always possible to salvage an ap- 
parently invalid job evaluation system in this 
way. Even when base values added ‘to each 
job value make the point ratio equal that of the 
wage-ratio, the differences between percent- 
ages of wages and percentages of value points 
of many key jobs may yet be significant. This 
may mean either that the system is really not 
valid for that particular situation, i.e., the 
factors and weights do not fit the jobs being 
evaluated, or the key jobs may not be truly 
key jobs. If the key jobs do not carry proper 
wage rates, the procedure of validating a sys- 
tem of job evaluation by calculating the sig- 
nificance of the differences between the wage 
rate percentages and the job value percentages 
of those key jobs is worthless. 
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Table 1 
Critical Ratios of Differences in Percentages of Salaries and Evaluation Points for 14 Clerical Key Jobs 











Evalua- 
tion 
Points 


Per Cent 
Annual of Total 
Salary 


Jobs Salary 





Per Cent Difference 
of Total in S.D. of 
Points Per Cent Diff. 


Diff. 
S.D. of Diff. 





$1,389 5.4 170 
1,320 5.1 150 
1,560 6.1 210 
1,740 6.8 270 
1,740 . 270 
1,500 ; 185 
2,040 360 
1,920 325 
1,740 255 
2,160 375 
2,100 360 
2,100 355 
2,100 355 
2,340 410 


$25,740 4,060 





On the other hand, a system of evaluating 
clerical jobs with apparent low validities, il- 
lustrated in Table 1, was salvaged by adding 
a base value of 186 to each key job value. 
This base value was determined by solving 
the proportion equation: 150 + X:410 + X:: 
$1320.00: $2340.00. This was made up from 
data from the two extreme jobs, 1-C-29 and 


1.2 344 

1.4 326 
379 
422 
422 
36 
48 
456 
Al 
485 
A78 
474 
474 
51 


1-C-17. Note in Table 1 that many CRs are 
significant, indicating that the evaluation 
system is not valid in this situation. How- 
evet, in Table 2, after the base value has been 
added to all job values, no difference is sig- 
nificant. The system is now valid for this 
situation. The formula for converting evalu- 
ation points into annual wage rates is: point 


Table 2 


Critical Ratios of Differences in Percentages of Salaries and Evaluation Points Plus Base Value 
of 186 for 14 Key Clerical Jobs 











Per Cent 
of Total 
Salary 


Evalua- 
tion 
Points 


Per Cent Difference 
of Total in 
Points Per Cent 


Diff. 
SD. of Dif. 


S.D. of 
Diff. 





301 
.300 
324 


317 
375 
365 
342 
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value plus 186 times 4. A reference table of 
values was constructed so that the translation 
of job values into wage rates was not calculated 
for each individual job.’ 


Summary 


When key jobs are evaluated, the evaluation 
points of specified jobs must be in the same 


3 It must be admitted that these tables show only the 
end results after a long process of altering the job 
evaluation system and dropping so-called key jobs 
because they did not carry proper wage rates. The 14 
jobs listed in these tables are those retained from an 
original list of 30. Key jobs are not key jobs unless 
they are representative of the entire group to be evalu- 
ated, and now carry proper wage rates. 


ratio as the wages (or salary values) of those 
jobs, or the system will certainly not be valid. 
If these ratios are not the same, often they may 
be obtained by adding a common base value to 
all evaluation points. This base value may 
be obtained by solving the proportion equa- 
tion: wage of job A is to wage of job B as value 
of job A plus X is to value of job B plus X._ It 


- must not be assumed, however, that the ad- 


justment of this ratio alone will make a system 
of job evaluation valid. This is just one of 
many factors that may cause a system of job 
evaluation to show low validity. 


Received March 27, 1950. 





Ready Made versus Custom Made Systems of Job Evaluation 


J. Stanley Gray and Marvin C. Jones 
University of Georgia 


A few popular point systems of job evalu- 
ation have been used more widely than was 
ever intended by their original creators. For 
example, the National Electrical Manufactur- 
ers Association system was adopted, not 
adapted, by a wide range of industries without 
validity justification, at least appearing in 
print. In fact, there is no evidence that this 
system was ever justified by a validity study, 
even in the electrical industry for which it was 
originally designed. It was both accepted 
and rejected on “faith,” depending on how it 
appeared to work. .Its wide use in situations 
for which it was not intended was often neces- 
sitated by the absence of a more appropriate 
system. The only alternative to the use of 
such a ready made system of job evaluation is 
to construct one to fit the situation in which it 
is to be used. This necessitates time and 
effort and is justified only if the results are 
worth the cost. 

The purpose of this study was to compare 
the worth of a ready made system of job evalu- 
ation with one constructed to fit the peculiari- 
ties of the jobs being evaluated. A ready 
made system of job evaluation’ was used to 
evaluate fifty jobs in a textile mill. The jobs 
were first carefully analyzed and the duties, 
responsibilities, and conditions were described. 
Using the ready made manual and factor 
weights, the jobs were evaluated by each mem- 
ber of a local committee (with representatives 
from both management and union) and by the 
junior author of this article. All differences 
were resolved in committee discussion and the 
evaluation of each job was justified in writing. 
Key jobs® were then selected and the critical 
ratio of differences between evaluation points 


' Smyth, R. D., and Murphy, M. J. Job evaluation 
and employee rating. New York: McGraw-Hill Book 
Company, 1946. 

2 Rey jobs are those which are typical in a plant and 
about which there is no disagreement concerning wage 
rates. In this case they were chosen by a consulting 
committee composed of representatives of both manage- 
ment and union in the mill being. evaluated. There 
were fifteen such jobs. 


and wages for those key jobs was calculated for 
each. These are shown in Table 1.2 While 
none of these ratios was large, they do indicate 
that at least half of the key jobs were not ac- 
curately evaluated. If a key job carries 7.9 
per cent of the total key job payroll, it should 
also carry 7.9 per cent of the total evaluation 
points of all key jobs. When it carries 9.1 per 
cent instead and the difference is found to be 
significant at the one per cent level, the obvious 
conclusion is that that job is not properly 
evaluated. 

A check of the distribution of degree values 
for each factor assigned to the fifty jobs, indi- 
cated further that the system did not fit the 
jobs being evaluated. Some of the factors 
were simply “dead weight.” They were not 
distinguishing variables. For example, factors 
1, 2, 7, 8, 10, and 12 were variables only in the 
three lower degrees and factor 9 was not a vari- 
able at all, as shown in Table 2. The sixth 
degree in all factors was unused and the fifth 
degree was used only in one factor and for one 
job. Either the factors were not adequate 
variables, or the degrees were not in proper 
gradations to indicate the actual variability. 

A “custom made” system of job evaluation 
was then developed and modified until it did 
fit the key jobs accurately. Only those factors 
were used which appeared in varied amounts 
in the jobs being evaluated. Each factor was 
divided into degrees or steps of gradation actu- 
ally present in the jobs being evaluated. Al- 
though six of the seven factors carried the same 
titles as those in the Smyth-Murphy system, 
the degrees were different in definition and in 
number. The distribution of the 50 jobs evalu- 
ated at each degree-of each factor used in the 
custom made system is shown in Table 3. 
Note that there were no dead weight factors 
and all degrees were actually used although 

5 This validating procedure is described by J. S. Gray, 


Custom made systems of job evaluation. J. appl. 
Psychol., 1950, 34, 378-380. 
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Table 1 


Critical Ratios of Differences in Percentages of Wages and Points for Key Jobs 
(Smyth and Murphy System) 











(1) (2) (3) (4) (S) (6) (7) 
Per Cent PerCent Diff. S.D. 

Code Hourly of Total of Total in Per of Diff. 
Number Job Wage Wage Points Points Cents Points S.D. Points 
6-19.166 Doffer (Spinning Room) 85 5.6 160 5.1 5 40 
6-19.117 Manual Winder Hand 87 5.7 170 .? ee Al 
6-19.113 Auto Quiller Hand 88 5.8 151 4.8 1.0 Al 
6-19.022 Frame Operator 

(Super Draft) 91 6.0 168 5.3 
6-19.533 Cloth Inspector 92 ‘ 188 6.0 
8-19.01 Blending Machine 

Operator 92 ‘ 212 6.7 
6-19.031 Card Hand 94 i 182 5.8 
4-15.020 Weaver .96 ; 205 6.5 
6-19.041 Spinner .98 . 181 5.7 
6-19.226 Slasher Hand 1.06 . 237 7.5 
6-19.827 Tying-in Machine 

Operator ° : 220 7.0 
5-83.641 Winder Room Section 

Hand 1.21 j 286 9.1 
5-83.324 Spinning Room Section 

Hand 1.21 7.9 8.6 
6-18.220 Card Grinder 1.21 7.9 265 8.4 
4-16.010 Loom Fixer 1.25 8.2 260 8.2 


Total 15.27 100.0 3156 100.0 








they differed in number with each factor. One were then assigned weights or point values (a 
factor had six degrees, two had five degrees, trial and error procedure) which provided an 
and four had four degrees. Variables and de- accurate evaluation of key jobs. The success 
grees of variability were designed to fit the of this step is shown in Table 4. None of the 
jobs being evaluated. Factors and degrees differences even approach statistical signifi- 


Yable 2 
Frequency of Jobs Rated at Each Degree of Each Factor for Smyth and Murphy System 











Degrees 





Factor 


rs 
> 
w 
> 
a 
io 





. Education 
Experience 

. Initiative 

. Physical Effort 

. Mental Effort 

. Visual Attention 
. Responsibility for Tools 

. Responsibility for Materials 

. Responsibility for Confidential Data 

. Responsibility for Reports and Records 
. Working Conditions 

. Unavoidable Hazards 





COMANKE ONE 


onoocoocoonerace 
oooocoooroocoeo 
eooocoooocoooeooo 
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Table 3 
Frequency of Jobs Rated at Each Degree of Each Factor for Custom Made System 








ist 


Degrees 





ad athSsSth— 





10 

22 
. Physical Effort 1 
. Visual Demand 1 
. Motor Ability 9 
. Working Conditions 3 
. Responsibility for Materials 27 





cance, indicating that the key jobs were ac- 
curately evaluated. 

A manual of directions was then written in 
which each factor and each degree under each 
factor was carefully described. The evalu- 
ations of the key jobs were written in as guides 
for evaluating other jobs. The evaluation 
committee discussed the definitions in the 


10 7 
4 

24 1 

18 

22 

38 

12 





spared in making the manual a useful handbook 
in evaluating other jobs. 

The final step was the evaluation of all jobs 
in the mill by each member of the committee 
and by Mr. Jones. Asin the use of the Smyth- 
Murphy system, all differences were resolved 
in committee discussion and the final evalu- 
ation of each job was written up on a justifica- 


manual and suggested changes. Noeffort was tion sheet. The correlation of the two evalu- 


Table 4 


Critical Ratios of Differences in Percentages of Wages and Points for Key Jobs 
(Custom Made System) 











(1) (2) (3) (4) (5) (6) (7) 
Per Cent Per Cent Diff. S.D. , 
of Total of Total in Per of Diff. 
Job Wage Wage Points Points Cents Points S.D. Points 
Doffer (Spinning Room) 85 5.6 125 5.6 -— -- 
Manual Winder Hand 8y & 123 $.§ r 49 Al 
Auto Quiller Hand .88 5.8 134 6.0 ; 49 Al 
Frame Operator 
(Super Draft) 91 6.0 6.1 
Cloth Inspector 92 
Blending Machine 
Operator 92 
Card Hand 94 
Weaver .96 
98 
06 


Hourly 





88 


6-19.533 
8-19.01 


6-19.031 
4-15.020 
6-19.041 
6-19.226 
6-19.827 


Spinner } 

Slasher Hand 1: 

Tying-in Machine 
Operator 

Winder Room Section 
Hand 

Spinning Room Section 
Hand 

Card Grinder 


SSess 


1.10 
5-83.641 
1.21 7.9 
5-83.324 
1.21 7.9 
1.21 7.9 
Loom Fixer 1.25 8.2 


Total 15.27 100.0 


6-18.220 
4-16.010 
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ations for the fifty jobs was .90+.018, which 
has but little meaning until other similar 
studies have been made. 

Now, taking the custom made system as the 
criterion (justified by the smaller CR’s in 
Table 4 than in Table 1) a check was made of 


Table 5 


Number of Jobs Evaluated Differently by Two Systems 
of Job Evaluation in Units of One-tenth Sigma 








Sigma 
Value 
Differences 


Sigma 
Value 
Differences 


No. of No. of 
Jobs Jobs 
0.0 3 F 3 
A + ; 2 
2 12 d 3 
3 0 
4 2 





S$ Total 50 


‘ 


the number of jobs misevaluated by the 
Smyth-Murphy system. These are shown in 
Table 5. Note that only three jobs were 
evaluated exactly the same (in sigma values) 
by both systems, and nineteen jobs were mis- 
evaluated by the Smyth-Murphy system by 
one-half sigma value or more. When these 
errors are translated into wage values at textile 
mill levels, they are more significant to the 
workmen concerned than any statistical treat- 
ment would indicate. The living standards of 
whole families are affected by pay differences 
indicated by a half sigma value. Though con- 
siderable effort is involved, the greater ac- 
curacy would seem to justify construction of a 
system of job evaluation to fit the jobs being 
evaluated, rather than using a ready made 
system that misevaluates one-third of the jobs 
as much as one-half a sigma value. 


Received A pril 15, 1950. 





Predicting Long-Range Performance of Substation Operators 


G. M. Worbois 
The Detroit Edison Company 


Selection tests are frequently validated on a 
criterion of success covering the first few 
months or years of employment. The tests 
may be shown to provide a reliable prediction 
of time to reach a production rate, progress in 
training for the job, or supervisors’ ratings of 
success. To allow for chance errors, the re- 
sults are checked on another, independent 
group. 

This leaves another problem which is some- 
times bothersome: Do the tests also predict 
performance over the entire employment pe- 
riod? When the initial criterion is highly re- 
lated to the criterion of long-range perform- 
ance, it would be expected. However, the 
initial criterion is not always highly related to 
the long-range criterion. For example, if the 
initial criterion is something like “learning the 
job,” and the long-range criterion is something 
like “dependability,” there may be little rela- 
tionship between them. 


The tests may be very useful in predicting 


the initial criterion. Perhaps that is all that 
should be expected from them. On the other 
hand, if the tests predict both initial and long- 
range performance, the contribution of the 
tests in building an efficient personnel organi- 
zation will be greater. 

There are also practical implications. When 
a battery of tests is constructed to predict an 
initial criterion, there is a tendency for “man- 
agement”’ to generalize, assuming that the tests 
predict performance over the entire period of 
émployment. The tests may or may not do 
this. Perhaps the psychologist has a responsi- 
bility to determine whether the test batteries 
predict long-range performance as well as the 
initial criterion. At least, the tests which pre- 
dict the initial criterion should not be nega- 
tively related to the long-range criterion. 


The Problem ~ 


In the present study, a battery of tests is 
compared to both an initial criterion and a 
long-range criterion. The tests and the job 
for which they were developed have been de- 


scribed by Viteles (6, pp. 260-273). After the 
tests had been validated and installed at the 
Philadelphia Electric Company, they were re- 
validated and installed at The Detroit Edison 
Company under the direction of Viteles (7, 4). 
Several criteria were used for comparison with 
test results. The one which showed best 
agreement with the tests, according to one 
report (4), consisted mainly of progress in 
learning the job.' For the present study this 
criterion seemed to be the best index of initial 
success covering the first few years of employ- 
ment. This is called the initial criterion. 

About 63 per cent of the men who partici- 
pated in the original validation (1929) remained 
on the job to be rated again by supervisors in 
1948. These ratings consisted of general abil- 
ity demonstrated over many years of service, 
and are called the long-range criterion.” 

There was not much agreement between the 
1929, or initial, criterion scores and the 1948 
ratings, or long-range criterion since the cor- 
relation was only .33. This raised the problem 
reported here: Do certain test standards de- 
veloped on the initial criterion for this situation 
show reliable cgreement with the long-range 
criterion? 

Procedures 


Scores on 10 tests were available for the total 
group of 119 employees taking the tests in 
1929. A battery of five tests was selected by 
the Wherry-Doolittle method (3) to give the 
best prediction of the initial criterion. Battery 
scores were computed for the 75 men remaining 
on the job.* These scores were then compared 
with the long-range criterion. 


1 The criterion consisted of the sum of (1) grades in 
the departmental training school and (2) grades for 
examinations covering practical operating problems (4). 
The standard deviation for the former was 6.9, while 
for the latter it was 2.8. The correlation between the 
school grades and the examination grades was .46. 

*As judged by agreement between several super- 
visors rating ind dently, the ratings were satisfac- 
torily reliable. (Correlations above .85.) 

ere were no significant differences in means or 
standard deviations between the test distributions for 
these 75 and the entire group of 119. 
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Table 1 


Test Weights and Multiple R’by the Wherry- 
Doolittle Method, N = 119 
2 


R 
362 





Tests 


Location 

Series Completion 
Blocks A 

Blocks B 
Directions 


Beta Weight 


.2289 


2151 
1100 


.1226 
1159 





449 
458 
464 





As another means of combining the tests, the 
multiple cutting-score method as described by 
Grimsley (1) was used. Three tests were se- 
lected by this method to give the best predic- 
tion of the initial criterion. Comparisons of 
scores on this battery with the long-range cri- 
terion were made for the 75 men remaining on 
the job. 

Results 


For the first method of developing the test 
battery (Wherry-Doolittle) the rule of “adding 
tests as long as there is an increase in R” was 
routinely followed. In Table 1 are shown the 
R’s as each test is added and the beta weights 
for each test. 

For the second method of developing the test 
battery, the procedures of Grimsley (1) were 
followed.‘ Table 2 shows the mean criterion 
scores and the per cent passing each standard 
for each of the criterion groups. From the 


‘The men were separated into three groups: the 27 
per cent rated highest; the 27 per cent rated lowest, 
with the remainder as the average group in terms of the 
initial criterion. 


trial batteries in this table it appeared that the 
best combination of tests would be: Location+ 
Directions+Series Completion. 

Table 3 shows the mean criterion scores for 
persons selected by different levels on the tests. 
The A level of standards on the tests is highest, 
and C is the lowest. The men who failed to 
meet the C standards on the tests are placed in 
the FF group. The criterion scores show how 
the men who passed different levels on the tests 
were rated. 

In columns 4 and 5 are shown the mean ini- 
tial criterion scores for those who passed the 
various levels on the tests. Thus, the mean 
initial criterion score for the 11 highest men on 
the W-D battery was 12.4. The mean initial 
criterion score for the 11 men selected by Level 
A standards on the MC-S battery was 11.4. 
It can be seen that for both methods the men 
passing higher standards on the tests had 
higher initial criterion scores. 

In columns 6 and 7 are shown the mean /ong- 
range criterion scores for the same men, and, of 
course, the same test standards. While there 
is some shrinkage for both methods, those pass- 
ing higher test standards have higher criterion 
scores. In columns 8 through 11 are shown 
the percentages of the lowest rated men who 
passed the various test standards. It can be 
noted that, in this group, either method of de- 
veloping the test battery on the 1929 data 
would have screened about three out of four 
men who over the years turned out to be poor- 
est on the job (as judged by the 1948 ratings). 
This is true if the lowest standards on the test 


Table 2 


Multiple Cutting-Score Method—Mean Criterion Scores and Percentages Selected by the 
Trial Ratteries at Each Critical-Scor: Level, N = 119 
‘ 








M 


Location+ Blocks B 

Location+ Directions 

Location+ Directions+ Series Completion 
Location+ Directions+ Series Completion+ Blocks A 
Location+ Directions+ Series Completion+ Pursuit 


Critical-Score Levels 


B 


% 
(5) 
59.7 
37.8 
53.8 
43.7 
31.9 
27.7 


M 
(4) 


9.8 
10.1 
9.8 
10.1 
10.3 
12.5 


“4 
(3) 


26.9 
17.6 
15.1 
13.4 
10.9 

8.4 
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Table 3 


Comparison of Groups Selected at Different Levels by the Two Methods on Both the Initial 
Criterion and the Long-Range Criterion, N = 75 








Per Cent of Lowest Rated Men 
Passing Test Standard 


Mean Mean ~ Initial 
Initial Long-Range Criterion 
Criterion Criterion N = 20 








Cases W-D* MC-S*  W-D* MCS**  W-D* MC-S* W-D* MCS* 
(1) (2) (3) (4) =) (6) = (7) (8) = 9) (10) at) 


A i 124 114 11.2 103 5% 10% 10% 8% 
B 24 11.5 11.0 10.7 10.2 15% 15% 18% 20% 


Cc 35 10.7 10.0 10.1 9.9 25% 40% 25% 30% 
FF 40 8.0 8.6 8.4 8.6 








*W- D—Wherry- Doolittle Battery Initial Criterion: M = 9.27, S.D. = 3.07 
** MC-S—Multiple Cutting-Score Battery Long-Range Criterion: M = 9.21, S.D. = 3.01 


Explanation: Level A (as determined by the multiple cutting-score (MC-S method) selected 11 men. Of 
these, 7 were also among the top 11 in scores on the Wherry-Doolittle (W-D) battery. Level B selected 24 men 
(11 of whom also passed level A). Of these, 17 were also among the top 24 on the W-D battery. Level C selected 
35 men. The FF group includes those who failed the C level. 


had been used. If higher standards had been _ provided as reliable prediction of the long-range 
used, a larger percentage of the poor men _ criterion. 
would have been screened. It should be noted that the results reported 
In Table 4 are shown the / values for differ- by Grimsley (1) showed as good predictive 
ences between mean criterion scores of those value for the MC-S battery as for the W-D 
who pass and those who fail the various test battery. However, several conditions are 
standards. The B level test standards show different in the two studies. In fact, the main 
the best discrimination between those who pass__ similarity is that the test batteries were de- 
and those who fail. The? values in column 4 veloped by the same methods. His study ap- 
show that any of the test standards set by the plied the test standards to the same criterion” 
W-D method on the basis of the initial criterion for a different group. The present study ap- 
would also have been reliable in predicting the plied the test standards to a different criterion 
long-range criterion. The ¢ values in the last for the same group. In following sections the 
column indicate that the MC-S standards de- _ test standards are applied to a different criterion 
veloped on the initial criterion would not have for a different group. 





' Table 4 


Significance of Differences (¢) of Criterion Scores for Those Passing and Those Failing Various 
Standards by the Two Methods, N = 75 








Initial Criterion Long-Range Criterion 


Level of W-D MC-S W-D MC-S 
Test Standard t t 
(1) 








t t 
(2) (3) (4) (S) 
A 3.94 2.52 2.34 1.13 
B 4.80 3.69 2.90 1.92 
Cc 4.36 1.87 2.88 1.84 











t required for significance: 5% level = 2.0; 1% level = 2.7. 








G. M. Worbois 


Table 5 


Significance of Differences (¢) of Criterion Scores for Those Passing and Those Failing Various 
Standards by the Two Methods, N = 98 
(Employed before 10-1-1929) 








Level of 
Test Standard N 
(1) (2) 


Long-Range Criterion 


Number of Cases 
in Common 
(5) 





A 17 
B 43 
Cc 58 


9 
32 
48 





t required for significance: 5% level = 2.0; 1% level = 2.7. 


Results for Other Groups. Test standards 
developed on the initial criterion for the above 
group may also be applied against the long- 
range criterion for other groups. The next 
group to which they were applied consisted of 
substation operators who had been employees 
of the Company in 1929 but were not given the 
tests during the original study. After the tests 
had been validated, they were routinely given 
to all of the substation operators. Most of 
these men were not included in the original 
study because they had been operators for less 
than a year, or because they were working in 
“automatic” rather than “manual” substa- 
tions. None of them was tested before em- 
ployment. There were 98 men who were 
given the tests shortly after 1929, and who 
were rated in 1948. 

The long-range criterion scores were com- 
pared for those who passed and those who 
failed the test standards. The significance of 
the differences are shown in Table 5. It can 
be seen that those who passed the tests at the 
A standard (highest) had significantly higher 








criterion scores. This was true for both 
methods of developing test standards. Those 
who passed the C standard on the MC-S bat- 
tery had significantly higher criterion scores. 

Another group to which the test standards 
may be applied is: those substation operators 
who were employed after the original study in 
1929. These people were employed partly on 
the basis of their test scores. Accordingly, in 
this group there is a smaller percentage of low 
test scores. Relationships with the criterion, 
accordingly, would be expected to be attenu- 
ated. 

There were 106 of these substation operators 
who had worked at least five years as operators, 
and who were rated in 1948. Their average 
length of service was 13.5 years. 

The long-range criterion scores were com- 
pared for those who passed and those who 
failed the test standards. The significance of 
the differences are shown in Table 6. It can 
be observed that at the B and C level there is 
a significant difference in long-range criterion 
scores using the W-D battery. At the A level 


Table 6 


Significance of Differences (¢) of Criterion Scores for Those Passing and Those Failing Various 
Test Standards by the Two Methods, N = 106 
(Employed after 10-1-1929) 








Level of 
Test Standard 
(1) 


Long-Range Criterion 





W-D 
t 
(3) 


MC-S 
t 
(4) 


Number of Cases 
in Common 





A 24 
B 64 
c 80 


0.54 
2.94 
2.86 


0.76 
0.00 
0.49 


13 
47 
70 





t required for significance: 5% level = 2.0; 1% level = 2.7. 
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for the W-D battery and at each of the levels 
for the MC-S battery, the differences are not 
significant. 


Discussion 


Test standards were developed on the initial 
criterion for one group and then were applied to 
a long-range criterion for (1) the same group, 
(2) a different group employed without benefit 
of the tests, and (3) a group selected partly on 
the basis of their test scores. This kind of ap- 
plication involves more than statistical re- 
gression found in applying test standards from 
one group to another, comparable group. 
Each group in this study is different. The 
second group is different from the first in that 
they were employed at a different time and had 
been assigned to different kinds of substation 
operating jobs. The third group is different 
from the first in that they were employed 
mainly in the depression years and low test 
scores were eliminated in the hiring process. 

Furthermore, the criterion is different from 
that by which the tests were developed. The 
tests designed to predict the initial criterion 
were not necessarily designed to predict the 
long-range criterion. The criteria themselves 
are not closely related. 

It might be surprising, therefore, to find 
much of any relationship between test scores 
and the long-range criterion for any of these 
groups. Any positive relationship found, how- 
ever, is of practical concern. It would indicate 
predictive value for the tests beyond that for 
which they were originally designed. From 
the employment standpoint, it would show the 
value of the tests in selecting employees who 
are likely to be successful over many years of 
service—as well as during the first few years of 
learning the job. 

The results for each group are not always 
consistent. For some groups a high standard 
appears preferable, for others a low standard 
gives better discrimination. For some groups 
one method of developing test standards ap- 
pears better than another. These variations 
of results from group to group indicate that 
validation of tests should be based on groups 
similar to those for which they are going to be 
used. As the employment market, nature of 
the job, etc., produce changes in the group, the 
influence of those changes may affect the pre- 
dictive power of the tests. This indicates the 


necessity for continuous evaluation of the test- 
ing procedure. 

The results also indicate that any one method 
of developing test standards may not always be 
the best one. Depending on the nature of the 
relationships between test scores, the selection 
ratio, and a number of other factors, one 
method may be preferable to another. As long 
as the relationships between test batteries are 
not perfect, there is a reasonable assumption 
that one may be better than another in a 
specific case.6 Even though the advantage of 
one may be small, if it provides a real increase in 
predictive value, its use may be justified in 
better selection. 

Summary 


Test batteries were developed for predicting 
the success of substation operators over the 
first few years of service. The test standards 
were then applied to a long-range criterion. 

The results indicate that test standards pre- 
dict performance at this job over many years 
of service, but that the same standard or the 
same method of developing the standard is not 
always best. 

Need for adjusting test standards as the 
group for which they are used changes is indi- 
cated. 
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Test Batteries for Trainees in Auto Mechanics and Apparel Design 


Glenn C. Martin 
Santa Monica City College 


Instructors at the Santa Monica Technical 
School are charged with the responsibility of 
accepting or rejecting candidates for training, 
with little but cursory information on work ex- 
perience and subjective appraisal to guide such 
decisions. This report describes the early 
stages of an effort to supply test data as partial 
but objective grounds for acceptance or rejec- 
tion in two areas—auto mechanics and apparel 
design. 

The instructors set up criteria of success in 
their courses and ranked their students on the 
criteria by order of merit. 

For auto mechanics the following items were 
considered: 


1. Manipulative Skill, ability to handle tools 
and materials. 

2. Shop Achievement, ability to carry on 
and complete shop assignments. 

3. Shop Habits, ability to do shop tasks in 
an orderly and efficient manner. 

4. Classroom Work, ability to comprehend, 
analyze, and solve theoretical problems, in- 
cluding use of needed mathematics. 

5. Personal Grooming, cleanliness and neat- 
ness of personal appearance. 

6. Leadership, ability to assume responsi- 
bility in a group situation. 

7. Reliability, ability” to use initiative in 
solving mc chanical problems. 

8. Persistence, patience and willingness to 
persevere independently of success. 

9. Ethics, honesty in the matter of returning 
borrowed tools, taking responsibility for ma- 
terials, breakage etc. 


In order to find the degree of independence 
of these judgments, intercorrelations among 
the ratings were calculated by the rank-differ- 
ence method. The coefficients varied from 
AS to .89, suggesting that there is either con- 
siderable overlapping of the criteria, or there is 
some halo-effect in applying them to students, 
or both. Factor analysis of the matrix would 
probably reveal common axes, but until that is 
completed there is no obvious reason to com- 
bine any of these criteria. 

For apparel design, students were rated on 
the following items: 


1. Reasoning, the ability to solve problems 
and plan work effectively. 

2. Mechanical Skill, ability to use materials 
and tools accurately and quickly. 

3. Social Skill, ability to adjust to group 
situations, ‘get along’ with fellows and au- 
thority. 


Intercorrelations among these varied from 
.91 to .96, making it clear that the instructor 
had not made the three criteria operation- 
ally distinguishable. Consequently, the three 
rankings for each student were summed, the 
sums again ranked to compose a consolidated 
scale. This result was used as the criterion 
for test correlations. Each original ranking 
correlated .95 or higher with the consolidated 
scale. 

The following tests were used in the study: 
(A) Guilford-Zimmerman Aptitude Survey, 
Part 7; (B) Bennett Test of Mechanical Com- 
prehension, Form AA; (C) Prognostic Test of 
Mechanical Abilities; (D) Survey of Mechani- 
cal Insight; (E) Macquarrie Test for Mechani- 
cal Ability; (F) Color Discrimination; (G) 
Tests in Fundamental Abilities of Visual Art; 
(H) California Test of Mental Maturity; (I) 
Kuder Preference Record, Form BB; (J) 
Martin-Kahn Selective Triad, a personality 
test. Tests (A), (B), (C), (D) were given to 
the auto mechanics group; tests (E), (F), (G) 
were given to the apparel design students; and 
tests (H), (I), (J) to both groups. The follow- 
ing parts of these tests proved useful, and are 
identified by number in Table 1: (A) 1. me- 
chanical information, (C) 2. spatial relations, 
and 3. blueprint reading, (D) 4. mechanical 
relations, (I) 5. mechanical interests, (J) 6. 
social perspective, (E) 7. dotting, 8. location, 
9. pursuit, and (G) 10. proportion. 

A glaring gap in the exploratory battery is 
the absence of a dexterity test. However, this 
may not be too serious, if the assertion of a 
qualified representative of the United States 
Employment Service is correct. St. Clair (3) 
states that no job analyzed by that organiza- 
tion requires more manipulative dexterity, as 
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Table 1 


Correlations of Tests with Instructors’ Criteria 








Zero-Order 
Stan Rho’s with 
Criterion Tests Criterion 


Corrected 
R 





Auto Mechanics 
N = 45 
Manipulative 43 
Skill 39 


Shop 32 
Achievement 31 


37 


Shop Habits 35 
37 
43 


58 
45 
54 
44 


Personal 36 
Grooming 39 


Leadership 31 
A4 
36 


Persistence 33 
52 
44 


51 
Al 


Apparel Design 
N = 25 
Consolidated 76 
A7 


Consolidated 56 
‘ .76 
1 53 





* Indicates significance beyond the 5% level of con- 
fidence. 


** Indicates significance beyond the 1% level of con- 
fidence. | . Jo ai i 


measured by finger dexterity and peg boards, 
than is possessed by 50% of all employed 
workers. 

Correlations between the above criteria and 
test variables were calculated by the rank- 
difference method. The resulting coefficients 


were tested for significance against the null 
hypothesis. 

Those correlations found to be above the 5% 
level of confidence were used to calculate beta 
weights and multiple R’s. If the number of 
significant predictors did not exceed two, the 
formula found in McNemar (2, p. 148) was 
used. When the number of sigmificant pre- 
dictors exceeded two, a condensed version of 
the Doolittle method was employed, described 
by Thorndike (4, pp. 336-339). 

For the sake of economy in the test battery, 
no test was retained unless it accounted for at 
least 5% of the variance in the criterion. 
Sometimes this resulted in lowering slightly the 
multiple R’s, but in other instances it actually 
raised them somewhat. This appears to result 
from complexity in the factor patterns such 
that negative relationships were obscured and 
so did not appear as negative rho’s, although 
their removal tended at times to raise the re- 
sultant prediction. Also, because of the ran- 
dom errors of measurement among the tests, 
the varying number of predictors, and small N, 
the obtained R’s are somewhat and differently 
inflated. They are corrected by the shrinkage 


‘formula in Garrett (1, p. 451). 


The data in Table 1 show the instructors’ 
criteria for which tests were found with sig- 
nificantly high correlations, the tests as num- 
bered above, zero-order rho’s of tests with 
criteria, corrected R’s, with significance data. 


Summary 


We have found tests which correlate sig- 
nificantly with instructors’ criteria in two 
training courses for these student samples. 
The multiple R’s improve the prediciions over 
those based on individual tests, although it 
must be expected that beta weights will fluctu- 
ate with other population samples. One factor 
yet to be determined is, how much do the cor- 
relations depend on training already received 
in the course? Since these student groups 
were, in the great majority, at about the same 
point of progress, this truncation obscures a 
very important possible relationship. To en- 
lighten this, a further procedure is now in 
progress, namely, to test applicants for training 
without changing selection methods, to verify 
predictions based on testing this new genera- 
tion of students, and only then to institute se- 
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lection based on testing within the limits of 
accuracy of those predictions. 


2. McNemar, Q. Psychological statistics. New York: 
Wiley, 1949. 
. 3. St. Clair, W. California State Employment Service, 
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4. Thorndike, R. L. Personnel selection: test and meas- 
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Validity of Tests for Auto Mechanics 


Edwin E. Ghiselli and Clarence W. Brown 
University of California, Berkeley 


The number of investigations concerned with 
the effectiveness of tests in the selection of 
skilled workers is by no means extensive, par- 
ticularly investigations concerned with the 
validation of tests against job proficiency 
rather than industrial training criteria (1). 
To some extent this situation is understandable 
since for certain of the crafts selection is not a 
major problem inasmuch as only those indi- 
viduals who have completed formal apprentice 
training are considered for employment. 
There remains a large number of skilled occu- 
pations, however, for which either a limited 
preparation by way of trade training is suffici- 
ent or for which the training is accomplished 
on the job itself. Among these latter occupa- 
tions is that of automotive mechanic. While 


practices vary in different areas and with dif- 
ferent organizations with respect to amount of 
previous training required, in this trade there 
exists very clearly the problem of the selection 


of novices. Since automotive maintenance 
and repair is attractive to a large number of 
young men, information concerning devices 
effective in the appraisal of aptitude for this 
occupation would be helpful for purposes of 
vocational guidance and for the planning of 
industrial selection programs. 

A review of the literature revealed only three 
sources which were concerned with the effec- 
tiveness of tests in the selection of automotive 
mechanics (2, 3, 4). In all instances test 
scores were referenced to grades in courses oi 
training in‘automotive maintenance and repair 
and none was referenced to measures of job 
proficiency. A summary of the findings of 
these investigations, together with those of 
several unpublished reports furnished to the 
authors, is given in Table 1. In effecting this 
summary, tests were classified into types ac- 
cording to the system used by one of us else- 
where (1). The first column gives the weighted 
mean of the validity coefficients computed 
through Fisher’s z transformation procedure. 
The second column gives the total number of 


individuals entering into each average, and the 
third column contains the number of validity 
coefficients reported for each type of test. 

From Table 1 it is apparent that intelligence, 
arithmetic, spatial relations, and mechanical 
principles tests give the best predictions of suc- 
cess in training courses concerned with auto- 
motive mechanics. Finger dexterity tests 
show slight promise, but for the remainder of 
the types of tests the coefficients either are too 
low or the data too few to justify satisfactory 
generalizations. 

It is questionable, of course, whether those 
tests found most useful in predicting success in 
training also will be the best in the prediction 
of job proficiency. The validity of intelligence 
tests, particularly, might be expected to reflect 
the more academic aspects of training criteria. 
As a working hypothesis, however, it would 
not seem unreasonable to consider that the 
validity of arithmetic, spatial relations, and 
mechanical principles tests also would prove 
the most successful in the prediction of job 
proficiency. Indeed, tests of these types com- 
monly are used in vocational counseling for 
appraising fitness for the occupation of auto- 


Table 1 


Summary of the Validities of Various Types of 
Tests in the Prediction of Success of 
Auto Mechanics Training 








Weighted 
Mean 
Validity 
Coefficient 


Total 
No. of 
Cases 


No. of 
Coefhi- 


Type of Test cients 





Intelligence ‘ 13 
Substitution 14 1 
Arithmetic : 13 
Tracing 

Pursuit 

Spatial Relations 

Speed of Perception 

Mechanical Principles 

Tapping 

Finger Dexterity 
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motive mechanic. The purpose of the present 
investigation was to test the notion that tests 
of arithmetic, spatial relations, and mechanical 
principles would be useful in predicting the 
actual job success of automotive maintenance 
employees. 


Methods and Procedure 


Tests of the afore-mentioned three types 
were administered to 225 gasoline bus mainte- 
nance and repairmen employed by a city transit 
company. The men were working in one or 
another of three repair shops of approximately 
equal size, each of which served a different di- 
vision of the company. Within a shop, the 
men were assigned to crews of about twenty 
men, each crew being supervised by a foreman. 
Ratings were utilized as the criterion of job 
proficiency. The rating form consisted of five 
scales covering the following characteristics: 
production, quality of work, dependability, 
cooperativeness, and knowledge. Each fore- 
man rated his men and these ratings were re- 
viewed by the shop superintendent in consulta- 
tion with the foreman. The final rating was 
based upon the consensus of the foreman and 
superintendent. Before the rating procedures 


were instituted, the foremen and superintend- 


ents were given individual instruction by a 
member of the personnel department of the 
company with respect- to the particular rating 
system to be used, the general problems of as- 
sessing proficiency and the common errors 
arising from the use of rating procedures. 


Results 


The validity coefficients, as given by the co- 
efficients of correlation between scores on the 
tests and ratings by supervisors are as follows: 
_ Arithmetic, r=.19; Spatial Relations, r=.21; 
» and Mechanical Principles, 7 ==.30. 

These results indicate that the best predic- 
_tions of job proficiency of the mechanics were 
given by the mechanical principles test. The 
validity coefficients of the arithmetic and 





spatial relations tests are fairly low, and under 
most circumstances would be considered to be 
inadequate. The low validity of the spatial 
relations test is surprising in view of the fact 
that it is so widely used as a measure of gen- 
eral mechanical aptitude. In the case of the 
arithmetic test the low validity is more under- 
standable, since automotive repair involves 
relatively few arithmetic computations. Me- 
chanical principles tests have not been widely 
used, especially in employee selection. The 
promising results with this type of test ob- 
tained in these other investigations are sup- 
ported by the present findings. 


Summary 


Previous investigations of the effectiveness 
of tests for the occupation of automotive me- 
chanic have been restricted to the study of 
validity in terms of predicting success in train- 
ing. The findings indicate that the best tests 
for this occupation are intelligence, arithmetic, 
spatial relations, and mechanical principles. 
In order to verify the predictive effectiveness 
of the last three types of test they were ad- 
ministered to 225 bus maintenance and repair- 
men and the scores were compared with ratings 
of performance by supervisors. The test of 
mechanical principles was found to have moder- 
ate validity. The validity of both the arithme- 
tic and spatial relations tests was found to be 
low. 
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Selection of Municipal Firemen 


W. M. Wolff 
Stanford University 


and 


A. J. North 
Southern Methodist University 


This study was undertaken as a preliminary 
approach to the general problem of the selec- 
tion of firemen. It was felt that from records 
available, some information could be obtained 
which would be a guide for further efforts to- 
ward developing better fire fighters. This 
particular analysis is primarily of written test 
material with ratings of privates by their 
captains serving as the criterion of validity. 

There is a lack of published studies concern- 
ing selection procedures for firemen. There 
appear to be only four published accounts 
which bear directly on this particular study and 
only one of these reported statistical results. 

The earliest study found was that of Moss 
and Telford reported in October, 1923 (3). 
The report exhibits a trend toward the selection 
of firemen on the basis of mental as well as 
physical capabilities. 

In 1927, the most comprehensive study of 
those reviewed here was published in Germany 
by R. Drill (1). This study covered a period 
of nine months during which 23 subjects were 
tested and ranked on physical, mental, and 
personal qualities. The criterion was an 
overall ranking of the men as firemen by the 
judges of the institute conducting the experi- 
ment. It might be of interest to note that the 
two most significant tests (statistically) were 
of the performance type—one in which the 
subject was required to find a signal in a smoke 
filled room and the other in which he was re- 
" quired to mount and descend a scaffolding by 
ladders. The final conclusion in this study was 
that the combined test scores and the profile 
were the most valuable predictive instruments. 

The next study (reported in 1933 by Miner) 
likewise reached the conclusion that a com- 
bined profile was an advantage in selecting 
firemen. Miner claims a high extroversion 


score is desirable and that a borderline intel- 
ligence is a necessary minimum (2). 

The members of the staff of the Public Ad- 
ministration Service present some valuable and 
comprehensive views in their pamphlet en- 
titled The Selection of Fire Fighters (4). This 
study would not be classified as a scientific 
experiment but the findings are based on some 
controlled observations. The report gives the 
greatest weight (60 per cent) to practical writ- 
ten tests of mechanical aptitude, mental abil- 
ity, and ability to understand and follow writ- 
ten and oral directions. Performance tests of 
physical strength and agility are given a 
weight of 25 per cent in the selection criteria, 
and the remaining 15 per cent is the weight 
given to an oral interview designed to bring 
out the applicant’s attitude, self expression and 
general personal suitability. In addition, the 
applicant must pass a medical test. It is sug- 
gested that appropriate credit be given for 
pertinent experience and training beyond the 
required two years of high school. 

Thus in considering these studies one can 
readily visualize the variation in emphasis on 
physical and mental qualities. 


The Present Study 


Procedure: The city concerned in this study 
is a Texas municipality with a population of 
about 500,000 people. The fire department is 
nationally recognized as being modern and 
progressive both in its performance and ideals. 

The subjects for this study are 144 firemen 
selected from a total group of 351 employed 
firemen. The men selected were those who 
had taken the newest form of the Written Test 
for Apprentice Firemen as revised by the city 
Civil Service in August, 1946. All men in this 
study met or surpassed the minimum qualifica- 
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tions for firemen which are: age 21 to 31 years; 
height, 59” to 6'4” in stocking feet; weight, 
157 to 230 pounds (stripped); satisfactery re- 
port on a rigid physical examination; Army 
Alpha score of at least 124 (or 70th percentile) 
and total Written Examination score of 70 per 
cent or better. The subtests of the total 
Written Examination are weighted as follows: 
General Knowledge (true-false), 26 per cent; 
General Knowledge (multiple choice), 54 per 
cent; Elementary Hydraulics, 6 per cent; 
Knowledge of Inflammables, 4 per cent; Arith- 
metic, 6 per cent; and Knowledge of Hazards, 
4 per cent. 

In selecting a criterion for evaluating the in- 
formation on the subjects, an efficiency rating 
blank in use by the city was not utilized since 
it had given low correlations with other factors 
in a previous study. A coefficient of +.08 had 
been obtained between the chief’s ratings of 
the men on work efficiency and an examination 
given for promotion to the grade of Fire 
Lieutenant. Also a coefficient of —.13 had 
been obtained as the, correlation between the 
efficiency ratings and the Wonderlic Personnel 
Test. The fact that the foregoing correlations 


were so low may be due to the unreliability of 
the subjective ratings, which on a per cent 


basis had a standard deviation of only 2.1. 

In the light of these results it was felt that 
some other criterion might be more valid. 
The criterion chosen was the rankings of the 
privates by the captain, or captains, who knew 
_ the fireman’s ability best. All captains were 
asked to rank all men under them in order of 
merit with respeci to overall ability as fireman. 
Any captain who felt he did not know the 
| ability of a “new” fireman (due to changing of 
location and shifts) left the ranking of that 
man up to the captain who did feel he knew the 
person’s ability best. 

In this way each private of the department 
was ranked by the captain knowing his ability 
a. The rankings of the 144 subjects in- 

volved in this study were then selected from 
the rankings of all privates. This resulted in 
17 groups of ranked subjects, varying in num- 
ber from four to fourteen men per group. 
More than half of the men were contained in 
the seven largest groups. 

The data available on the subjects of the 
study were the test scores previously men- 





tioned (in departmental qualifications) and 
personal data including the subject’s age at 
his last birthday, his years of formal education, 
his years of residence in the city limits and his 
months per voluntary job. 

The data may be briefly described as follows. 
The Written Examination for Apprentice 
Firemen is a comprehensive test devised by the 
city Civil Service to measure the applicant’s 
general knowledge and the elementary informa- 
tion which he has which is pertinent to fire 
fighting. The test contains 69 questions in 
each of the general knowledge subtests, the 
hydraulics section has six questions and the 
three remaining parts have ten questions each, 
giving 174 questions for the total Written 
Examination. 

The intelligence test is the Army Alpha re- 
vised by Schrammel and Brannan. The 
Kuder Preference Record is a series of ques- 
tions designed to quantify a person’s interests 
in certain areas. The test of Mechanical Com- 
prehension is that of G. K. Bennett and yields 
a score based on the number of right minus 
half of the wrong answers since all questions 
have three choices in the answer. The Fire 
Department Achievement Test is a series of 
true-false and multiple choice questions aimed 
at testing the retention of fire fighting informa- 
tion given the fireman during his 15 month 
training period. The questions cover ladders, 
tarpaulins, hydraulics, elementary chemistry, 
safety, etc. 

The personal data were taken from the ap- 
plication form; the years of formal education, 
age at hast birthday, and years of residence 
within the city limits were merely a matter of 
tabulation. The number of months per volun- 
tary job is a ratio obtained from the number of 
jobs reported and the total months worked. 
Military service, schooling and work on the 
farm of the father of the applicant were ex- 
cluded in an attempt to get a measure of the 
average time spent on a job of the applicant’s 
own choosing. 

The Written Examination and the Alpha 
are given at the time of application; the Ben- 
nett and Kuder are administered at the time 
of hiring the worker and the Achievement test 
fifteen months after the date of beginning 
work. Thus some men applying at different 
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times would not have certain score records and 
not all of them were given the Kuder. 

Technical Section: To facilitate the analysis 
of the data each of the 17 groups was broken 
down into sub-groups of “higher-ranked” men 
and “lower-ranked” men with the median 
rank as the dividing line. Each group con- 
tained the same number of men above as below 
the median of the group. The middle ranked 
subject was eliminated from any group having 
an uneven number of subjects within the 
group. 

The method of analyzing the results in this 
study was, first, to compare the means of the 
higher-ranked and lower-ranked subjects within 
groups with respect to the several variables; 
and secondly, to measure the degree of rela- 
tionship between the variables showing the 
most significant difference between the means. 

Where data were available on all subjects 
the “‘t” test for related measures was used in- 
asmuch as paired sub-groups had been ranked 
as a single group. This test seemed feasible 
even though the higher-ranked men of one 
group could have been lower in overall ability 
as firemen than the lower-ranked men of an- 
other group. Where data were not available 
on all subjects it was felt that there were not 
enough men within sub-groups to really test 
sub-group differences. Therefore, these data 
from the various groups were pooled into a 
higher-ranked assembly of subjects taken from 
all higher-ranked sub-groups and a_ lower- 
ranked assembly taken from all lower-ranked 
sub-groups. These data were than analyzed 
by the ‘“‘t” test for independent measures. In 
doing this, it was assumed that although the 
sensitivity of the test would be decreased the 
data on the men in different groups were on a 
sufficiently comparable basis to be validly 
pooled. This also did not require discarding 
any data since the test for independent meas- 
ures does not require the same number of cases 
within groups. 

After these tests for the significance of the 
differences between the means were carried out 
the most significant items were analyzed to 
determine their degree of relationship. Since 
only two variables showed significance at our 
chosen level (two per cent), these were corre- 
lated with each other by the product moment 
method. In addition, the estimated test reli- 


ability of the most significant variable was cal- 
culated. Also, this variable was correlated 
with the criterion. The procedure selected was 
somewhat unusual and, therefore, should be 
clarified here. The groups containing ten or 
more men were selected for analysis since it was 
felt that this was the minimum number of 
cases within groups which was acceptable for 
obtaining a meaningful correlation. These 
groups contained more than half of the subjects 
of the study (80 of the 144). The lowest man 
in each group was given a rank of one, the 
second two, and so on to the highest ranked 
man. These rank values were then correlated 
for each group with the scores of the men on 
the most significant variable. Next, these 
seven coefficients were “averaged” to obtain a 
coefficient for all 80 men. This average co- 
efficient was determined by Fisher’s technique 
which involves a logarithmic transformation 
of the coefficients. The coefficients are also 
weighted in proportion to the differences in the 
sizes of the component groups. It is realized 
that these several groups are not strictly ran- 
dom samples from the total. 


Results 


The results of the analysis of the data are 
shown in Table 1. 

The total Written Examination for appren- 
tice firemen and the Bennett Test of Mechani- 
cal Comprehension were significant at the two 
per cent level of confidence. No other items 
approach the two per cent level except the 
two items of age and the part of the Written 
Examination concerned with inflammables 
(two to five per cent). The significance of 
age suggests the desirability of further investi- 
gation to isolate those specific concomitants 
of age which may relate to success as a fireman. 

The items concerned with the multiple 
choice general knowledge part of the Written 
Examination and the ratio of the months per 
job approach a level of significance that cannot 
be entirely ignored (five to ten per cent). The 
factor of months per job would seem to indi- 
cate for these 144 men that those who were 
rated in the upper groups had spent more time 
per job (average) than had those in the lower 
groups. 

The product moment coefficient of correla- 
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Table 1 
Comparison of the Means of the Higher-ranking and Lower-ranking Firemen on the Several Variables 








Number 
of 


Variable Subjects 


Means 


Higher 


S.E. of Probability 
Diff. in in 
— Per Cent 


Lower 





. Total Written Examination 

. Test of Mechanical Comprehension* 
. Age at Last Birthday 

Knowledge of Inflammables 

. Months per Voluntary Job 

. General Knowledge (multiple choice) 
. Elementary Hydraulics 

. Years Residence in the City 

. Arithmetic 

. Kuder Preference Record* 

. General Knowledge (true-false) 

. Army Alpha 

. Achievement Test* 

. Knowledge of Hazards 

. Years of Formal Education 


CRNIANEWH | 


148.3 
44.5 
24.6 

7.6 


Less than 1 

1-2 
2-5 
2-5 
5-10 
5-10 

10-20 

10-20 


4.3 


11.2 





* The “‘t” test for independent measures was used on these variables. The “t” test for related measures was 
used on the remaining variables. See text for explanation. 


tion between the total Written Examination 
and the Bennett Test of Mechanical Compre- 
hension for the 79 men having Bennett scores 
was 0.60. This coefficient is significantly dif- 
ferent from zero at the one per cent level of 
confidence. The Bennett test was not cor- 
related with the criterion since the 79 men 
concerned were scattered over 17 groups and 
it was not felt there were enough men per group 
to calculate reliable within group coefficients. 
Neither could the data be pooled to investigate 
the degree of relationship since there were 
different absolute standards of ranking. 

The total Written Examination was corre- 
lated with the criterion, as explained in the 
technical section, and gave an “average” co- 
efficient of 0.30, which is significantly different 
from zero at the one per cent level of confi- 
dence. Although this is not “high,” it is a 
meaningful measure. It would hardly be ac- 
ceptable to select firemen on the basis of this 
test alone, but it would be helpful in a battery 
of tests for selection purposes. 

Lastly, the split-half reliability of the total 
Written Examination was calculated by cor- 
relating the number missed on odd-numbered 
questions with the number missed on even- 
numbered questions. The results for the 144 


men gave an estimated test reliability of 0.79. 
This estimated test reliability with its standard 
error of 0.04 furnishes a preliminary evaluation 
of the test reliability (see 6). 


Summary 


From a total of 351 privates in the Fire De- 
partment of a Texas city, a group of 144 was 
selected who had complete records for a revised 
test battery administered by the city Civil 
Service Commission. The criterion for the 
analysis was the rankings of the subjects on 
their ability as firemen, obtained from the 
captains who knew the individual’s ability best. 

The results were as follows: 

1. The total Written Examination for ap- 
prentice firemen and the Bennett Test of Me- 
chanical Comprehension showed a significant 
difference between the means of the higher- 
ranked subjects and the lower-ranked subjects 
at the two per cent level of confidence. 

2. The applicant’s age and his score on the 
division of the Written Examination dealing 
with his knowledge of inflammables showed a 
significant difference between the means at the 
five per cent level of confidence. 

3. The total Written Examination yielded a 
correlation coefficient of 0.30 with the private’s 
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rankings and correlated positively with the 
test of mechanical comprehension as shown by 
a coefficient of 0.60. These coefficients are 
significantly different from zero at the one per 
cent level of confidence. 

4. The total Written Examination gave an 
estimated test reliability coefficient of 0.79 
with a standard error of 0.04. 

5. The Kuder scores, the months per volun- 
tary job and the years of residence within the 
city limits are suggested as possibilities for 
further investigation although they did not 
exhibit a high level of statistical confidence in 
the present study. 


Received March 29, 1950. 
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Difficulty and Validity of Analogies Items in 
Relation to Major Field of Study * 


Jerome E. Doppelt 
The Psychological Cor poration 


When a test is made up of verbal items drawn 
from various subject matter areas it is reason- 
able to expect relevant previous training of the 
students to be a factor affecting test perform- 
ance. Obviously, the student who has had 
several years of study of physics, for example, 
will find physics items easier than will students 
without such previous training. If the people 
tested are college seniors or graduate students, 
then it seems even more likely that items from 
a particular content area will be found to be 
considerably easier by students who have 
majored in that area than by those who have 
majored in other fields. Not only the diffi- 
culty of items but their validity may be 
markedly affected by the students’ previous 
training. When asingle test score is based on 
questions in several areas of knowledge, the 
problem of bias in favor of certain groups must 
be considered. Such a score, if interpreted as 
a measure of intelligence, may be quite mis- 
leading for it is conceivable that the score is 
more a reflection of specialized training than an 
indication of general ability. 

These facts are generally known to test 
makers. But verbal items must be based on 
content appropriate to the population which 
is to be measured. If the people to be tested 
are at the college senior level, then there will 
be students who have had intensive training in 
certain areas. In seeking a measure of intel- 
ligence for such people one possible approach 
is to balance the test with items from many 
different areas so that no group of students is 
unduly favored because of special training. 
Furthermore, one might use only items which 
are more or less “general cultural’’ items in the 
different areas and avoid the use of questions 
which require highly specialized training. An- 
other alternative is to use a type of item which 
requires reasoning ability rather than memory 
or recognition of specific facts. 


* Presented as a paper at the American Psychological 
Association Convention, September 5, 1950. 
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The Miller Analogies Test makes use of 
these various approaches in measuring the in- 
telligence of college seniors and graduate stu- 
dents. The items are in the form A:B=C:D 
with one of the four parts omitted. The stu- 
dent is required to select that option out of four 
which establishes a reasonable analogy. Since 
the test is aimed at a high level of ability, it 
seemed advisable to select the items from the 
areas of science, mathematics, literature, his- 
tory, psychology, and the like. The stress, 
however, is on the relationships between the 
parts of the analogy rather than on specific 
knowledge of a field. Nevertheless, it was con- 
sidered desirable to determine whether a clas- 
sification of items according to content dis- 
closed any relationships between the major 
fields of the students and the difficulty and 
validity of the items. 

The subjects of the study were 5,311 college 
seniors and graduate students who had taken 
Form G of the Miller Analogies Test. These 
people were studied in two groups: Group A, 
which included the first 3,856 cases tested, and 
Group B, which was composed of 1,455 people 
tested after the data for Group A had been 
gathered. 

Three major field categories were used to 
classify the students: science majors, including 
majors in the physical and biological sciences, 
mathematics, engineering and the like; non- 
science majors, including majors in the social 
studies, languages, arts, etc.; and the rela- 
tively homogeneous category of psychology 
majors. The psychology students were stud- 
ied separately because a large number of cases 
was available, due to widespread use of the 
test by departments of psychology, and partly 
out of curiosity, since earlier norms on the 
Miller Analogies Test had shown psychology 
majors were a superior group. 

The classification of the items of the Miller 
Analogies Test by subject matter area was a 
difficult job. In some instances, half of an 
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analogy item would be drawn from a scientific 
field and the remaining half from a non-science 
area. Should such items be called science or 
non-science? The rule finally decided upon 
was to classify an item in the science category 
if the judge felt that knowledge of science was 
most helpful in determining the correct an- 
swer; to classify the item as non-science or psy- 
chology if knowledge of one or the other area 
was considered most useful in answering the 
item correctly. Three psychologists worked 
independently on the classification of items. 
The judges agreed on the classification of all 
but five items and these were resolved in con- 
ference of the three judges. The number of 
items finally classified as science was 29, and 
65 items were considered non-science. The 
psychology category included only six items 
and was therefore not considered in the study. 

The science and non-science items were 
analyzed for the science, non-science and 
psychology major field groups. For every 
item the difficulty and validity indices were 
computed for the three curricular groups and 
for the total number tested. The validity or 
discrimination index was a point-biserial cor- 
relation coefficient between the item and total 
test score. Computations were made separ- 
ately for Group A and for Group B. The re- 
sults are shown in Table 1. 


It may be noted that the psychology majors 


find both the science items and the non-science 


items easier than do the science and non-science 
majors. This is consistent with the norm 
tables for the Miller Analogies Test, Form G, in 
which psychology majors are the highest 
scoring group. There may be several reasons 
for this finding but two possibilities will be 
mentioned. Higher standards have been es- 
tablished by graduate departments of psy- 
chology and perhaps the better students are 
now applying for admission to this field. Also, 
it may be noted, psychology majors are gener- 
ally more testwise than other students, and 
perhaps this is of some help in obtaining higher 
scores. Psychologists will probably feel that 
the first of these two possibilities is much more 
acceptable. 

With respect to average item difficulties, the 
relationship between the types of items and 
the major fields of the students is not quite 
what one might have anticipated. It would be 
reasonable to expect science people to excel 
other majors on science items, and non-science 
people to excel on the non-science items. 
However, science majors find the science items 
considerably easier than do non-science majors 
but science majors also do better on the non- 
science items. It appears that the order of 
excellence, based on difficulty of items, is psy- 
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Table 1 


Mean Difficulty and Validity Indices of Science and Non-Science Analogies 
Items for Three Major Field Groups 


p** r r 


Students’ sibel BS 
Major Field N Mean S.D. 


Mean S.D. N 


Mean S.D. 





29 Science Items 
Science 1025 f Ps | a 05 190 E ‘ 32 
Psychology 1012 é .19 . 09 246 é ‘ ol 
Non-Science 1585 48 18 , 08 951 ; ; 32 
Total* 3856 J 17 B. 07 1455 


65 Non-Science Items 
Science 1025 A 19 a .08 190 88 ° 35 Al 
Psychology 1012 65 19 A .06 246 .66 ; 31 .10 
Non-Science 1585 53 17 ; 08 951 49 : 36 .08 
Total* 3856 58 18 . 10 1455 53 ‘ 37 10 





* The “Total” includes the three major field groupings, plus students who did not indicate their major fields. 
** p—proportion of students who answered item correctly. 
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Table 2 
Means and Standard Deviations on MAT of Curricular Groups 











Group A 





Major Field Mean 


Group B 


N Mean 








Science 
Psychology 
Non-Science 
Total 


59.0 
64.7 
51.5 
$7.2 


190 56.1 
246 66.7 
951 47.8 

52.3 





chology majors, science majors, and non-sci- 
ence majors, regardless of the classification of 
the item. This may be partly due to the type 
of item used in this test but more will be said 
about this idea later. 

Differences between average validity indices 
for science and non-science majors are more 
marked in Group A than in Group B. For the 
science items, the non-science majors in Group 
A have an average validity index which is .04 
higher than that for the science majors in the 
same group. But in Group B there is no 
difference between the averages for the two 
curricular groups. For the non-science items, 


the average index for the non-science majors 
in Group A is .05 higher than the average for 


the science majors. However, the difference 
shrinks to .01 in Group B. Since the validity 
indices are coefficients of correlation between 
the items and total test scores, they would be 
affected by the range of total scores on the 
test. The standard: deviation of such scores 
for the non-science majors of Group A was 
16.9 (see Table 2); for the science majors of 
Group A it was 14.3. In Group B the respec- 
tive standard deviations were 16.0 and 15.4. 
The differences in average validity indices of 
science and non-science majors in Group A 
may be attributed in part to the difference in 
standard deviations of total test scores, a dif- 
ference which is more marked in Group A than 
in Group B. The psychology majors, whe 
are a more homogeneous group than the 
broad categories of science and non-science 
majors, have a total-test standard deviation of 
approximately 13. This is considerably lower 
than that for the non-science majorsand slightly 
lower than that for the science majors of either 
group. The item validity indices for the psy- 
chology majors seem to follow the pattern of 


the standard deviations—they are generally 
lower than those of the non-science majors and 
only slightly below those of the science majors. 

This discussion of the standard deviations of 
the total test scores is not intended as a full 
explanation of such differences as appear in 
Table 1. It is offered as a factor to be con- 
sidered in evaluating the size of these differ- 
ences. From a’practical viewpoint, the data 
for Groups A and B would lead to the conclu- 
sion that the discriminating power of the items 
seems to be relatively independent of the 
major fields of the students. The science items 
tend to have somewhat lower average validity 
coefficients for all three curricular groups than 
do the non-science items. 

Items have been referred to as science items 
and non-science items although it will be rec- 
ognized there are many limitations to this type 
of classification for the analogies items con- 
sidered here. Many of the items are of the 
general cultyral type—that is, the subject 
‘matter is such that most educated people 
would have been exposed to it regardless of 
their major field of specialization. An item 
which calls for the relationship between simple 
decimals and fractions may technically be 
classified as mathematical or scientific ma- 
terial but in reality it is elementary arithmetic. 
Similarly, the association of Pasteur with 
antisepsis is common cultural knowledge 
rather than specialized scientific information. 
In an:!ogies items the primary purpose is to 
determine whether the student can identify the 
relationships between the parts of the analogy. 
It is possible to construct analogies which re- 
quire highly specialized training before the 
reader can seek relationships. For such items 
considerable correlation may exist between 
item difficulty and validity and major field of 
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the subjects. However, the Miller Analogies 
Test has generally avoided items based on 
esoteric material. The data indicate that 
when such material is not used as the basis for 
the analogies, items may have discrimination 
indices which are relatively unaffected by the 
students’ major fields. 

Differences in difficulty values may be ex- 
pected even when esoteric subject matter is 
avoided. But it does not necessarily follow 
that people with specialization in allied fields 
will excel on the items taken from their fields. 
Although there were 65 items in this study 
which seemed to be free of scientific context, 
science majors found these items easier, on the 
average, than did non-science majors. 


Emphasis on the relationships between 
parts of an analogy can result in a set of items 
of this type which are relatively free of the 
effect of intensive specialized training. Such 
items, which apparently emphasize the per- 
ception of relationships, can be particularly 
useful in the measurement of intelligence of 
high-level groups who have been exposed to 
different types of training. Adequate dis- 
crimination among such people may be ob- 
tained by well-constructed analogies items 
without using very specialized or esoteric 
material. 


Received December 18, 1950. 
Early publication. 














Study of Executive Leadership in Business. IV. Sociometric Pattern 


C. G. Browne 
Wayne University 


This is the last in a series of articles dealing 
with the following methods of studying execu- 
tive leadership in business: R, A, and D Scales, 
social group patterns, Goal and Achievement 
Index (1,2, and 3), and sociometric pattern. 
The subjects for the study were 24 executives 
in a tire and rubber manufacturing company. 
They included all the executives on the 1st, 
2nd, 3rd, and 4th echelons of the business, with 
the exception of one executive on the third 
echelon. The executives were classified into 
departmental activities as follows: General 
Administration, 4 cases; Sales, 6; Finance, 4; 
Manufacturing, 8; and Personnel, 2. 


Sociometric Procedure 


Moreno (5, p. 11) defined sociometry as 
“that part of socionomy which deals with the 
mathematical study of psychological proper- 
ties of populations, the experimental technique 
of and the results obtained by application of 
quantitative methods.” In his and other 
studies which have been made with the soci- 
ometric approach, individuals have been 
asked to choose from any given group, those 
individuals within the group with whom they 
would mest or least like to live, or sit next to, 
or to have as their leader, and so forth. In 
this way the attitudes and general regard which 
members of a group have for other members 
have been determined. These results usually 
have been presented in the form ofa sociometric 
diagram. 

The “choices” which were made in the pres- 
ent study were not entirely free choices, since 
the executives reported conditions at the time, 
some of which may not have been the result of 
executive preference. Therefore, they did not 
report with whom they would like to spend 
their time, but rather with whom they did 
spend their time. However, it is believed that 
the element of preference or choice still is in- 
volved in this situation due to the operation of 
such factors as selection and placement of 
executives, the freedom of the individual to 


deviate from strict lines of authority in the 
industrial organization, and the differing 
operating relationships between the informal 
and formal organization charts of the business. 
Nevertheless, since this may not be Moreno 
sociometry in the strict sense, the choices 
made by these executives will be referred to as 
“choices.” 
Sociometric Pattern 


As part of a 2} to 3} hour interview, each 
executive named the individuals with whom he 
spent most time in getting his work done, broken 
down into three groups: (1) people in his own 
department; (2) people outside his depart- 
ment, but within the company; and (3) people 
outside the company. Figure 1 presents, in a 
sociometric diagram, the first and second 
“choices” of the executives on the basis of 
maximum time spent with men either within 
or outside the executive’s own department, but 
including only the executives in the study.' 
This group was chosen for the diagrammatic 
presentation here because of its obvious ad- 
vantage in this study concerned with the inter- 
relations of this particular group of executives. 
An arrow coming into the executive’s box on 
Figure 1 indicates that he was named as one of 
the two men with whom most time was spent 
in getting work done by the executive with 
whose box the arrow connects, a first “choice” 
meaning most time spent and a second “choice” 
meaning next most time spent. 

The basis for Figure 1 is an organization 
chart of the business. It will be noted, how- 
ever, that the form of the organization chart 
is not the traditional form in which the chart 
is interpreted from the top to the bottom. 
The chart used here is a concentric organization 
chart (4) in which the organization is dia- 
grammed and analyzed as revolving around a 
focal point, in this case the President and 
General Manager. 

1The “choices” which the Technical Manager re- 


ceived are included, since he is included on the organi- 
zation chart. 


y 
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An analysis of Figure 1 makes it clear that 
the working relationships in this group of ex- 
ecutives centered around three men: the Vice- 
President-Manufacturing, who received four 
first and four second “choices”; the Vice- 
President-Sales, who also received four first 
and four second “choices”; and the Treasurer 
with four first and two second “choices.” 
With one exception, all of the VPM contacts 
were with executives who were under his super- 
vision the exception being that the VPS se- 
lected the VPM as his second “choice.” The 
VPS, however, received first ‘“‘choices’” from 
the President and General Manager, the 
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Treasurer, and a second “choice” from the 
Chief Cost Accountant, none of whom were 
under the supervision of the VPS. None of 
the executives included in the study were under 
the supervision of the Treasurer, so that all of 
his “choices” had to come from executives out- 
side his own department. It might be noted 
that only one of those who “chose” the Treas- 
urer was outer to the second echelon of execu- 
tives, this being the Manager Congo Stores 
who jumped his immediate superiors, the Sales 
Manager and the Vice-President-Sales, in 
“choosing” the Treasurer. 

The President and General Manager re- 
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ceived only one first and one second “choice” 
from the Director of Public Relations and the 
Secretary of the Company, respectively. This 
could be expected since, in studying the work 
performed by these men, it was clear that they 
acted as assistants to the President and Gen- 
eral Manager and performed the details of 
some of his major functions. 

This presentation of personal relations 
within the executive group on a concentric 
organization chart offers the opportunity to 
survey these relations in a rapid and critical 
manner and to gain some insight into the 
working patterns of the executives. While the 
establishment of criteria for the evaluation of 
this presentation was not included within the 
study, certain observations can be made. 

The President and General Manager with 
few “choices” on the sociometric pattern ap- 
parently was delegating a large amount of 
activity to executives on outer echelons. This 
was supported here by the large number of 
“choices” received by the three heads of the 
major departments on the second echelon— 
Sales, Manufacturing, and Finance. It was 
also supported in the R, A, and D Scales (1) in 
which the delegation of authority score for the 
President and General Manager indicated that 
he was delegating to a greater degree than any 
other executive. 

Some insight into the relationships between 
the formal organization of the business as indi- 
cated on the organization chart and the in- 
formal organization as indicated by the soci- 
ometric “choices” may also be gained from 
Figure 1. In the manufacturing division, all 
of the executives except the Manager Shipping 
gave either their first or second “choices” to 
their immediate superior. This was also true 
in Finance. However, in the Sales Division, 
two of the four men in the fourth echelon—the 
Manager Congo Stores and the Manager Sales 
Orders—gave a first and second “choice,” 
respectively, to the Vice-President-Sales, but 
gave no “choice” to the Sales Manager who 
was the executive to whom they were immedi- 
ately responsible. The Manager Congo Stores 
gave his second “choice”’ to the Treasurer, and 
the Manager Sales Orders gave his first “‘choice” 
to the Manager Production Control. This 
situation may have been the result of necessity 
in that the work of the individual related to 


the functioning of executives other than his 
immediate senior. On the other hand, it may 
be that a close analytic study of the relations 
within the group would have revealed the need 
for some corrective planning. 


Sociometric Choice Correlations 


The use of the R, A, and D Scales was first 
reported by Stogdill and Shartle in their studies 
of Naval Leadership (6) and its use with busi- 
ness executives was reported in the present 
series of articles (1). Correlations between 
number of sociometric “choices” and the ex- 
ecutive’s score on each of the three factors in 
the R, A, and D Scales were as follows: R 
(responsibility), .29; A (authority), .27; D 
(delegation of authority), .48. These correla- 
tions can be interpreted only as descriptive 
statistics for this population of executives, but 
they indicate a trend for the executives with 
the greater responsibility and authority, and 
particularly for those executives who believe 
they delegate authority to a large degree to be 
“chosen” as those with whom other men spent 
their time in getting their work done. 

A correlation of —.17 was obtained between 
sociometric “choice” and the per cent of time 
the executive spent in supervision. That is, 
the executives who spent more of their time 
supervising the activities of others were some- 
what less likely to be “chosen” in the sociomet- 
ric pattern. This substantiates the relatively 
high positive correlation between sociometric 
“choice” and delegation of authority, since this 
positive correlation indicates that those men 
who gave their subordinates more of a free 
hand and did not exercise as much personal 
supervision tended to be “chosen” with greater 
frequency on the sociometric pattern. As 
might be expected, a positive correlation of .34 
was obtained between sociometric “choice” 
and the per cent of time spent with persons. 

Although these correlations were small, they 
were indicative of the relationships between 
the methods which the executive used in carry- 
ing out his functions and his relationships with 
other executives. It is reasonable to hypothe- 
size that a more detailed analytic study of the 
methods used by executives and the ways in 
which they spend their time would yield higher 
correlations. 
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Summary 


The inner 24 executives of a tire and rubber 
manufacturing company were asked to name 
the men with whom they spent most time in 
getting their work done. The data for their 


1st and 2nd time spent “choices” were pre- 
sented in a sociometric pattern, the basis of 
which was a concentric organization chart. 
From these data, it appears that the socio- 
metric method offers the opportunity to study 
certain aspects of the following variables of 
executive leadership in business: 


. Interpersonal relationships within the 
organization. 

. Communication channels among person- 
nel, 

. Differences in the flow of activity as re- 
ported on the formal and informal or- 
ganization charts. 

. Study of methods employed in the per- 
formance of leadership functions. 

. Insight into desirable or needed correc- 
tions and modifications in personnel 
relationships. 
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6. Comparison with other variables in lead- 
ership activities to aid in the eventual 
selection, training, and replacement of 
leaders and the evaluation of executive 
and leadership performance. 
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A Human Relations Training Program * 


Ralph R. Canter, Jr. 
University of California, Berkeley 


In recent business and industrial literature 
considerable emphasis has been given to the 
conceptual area called “human relations.” 
Numerous writers have become deeply con- 
cerned with the implications of business and 
industry social. practices and the interpersonal 
and group relations of their members (see re- 
view in 1). The term “human relations” has 
been applied to these practices and relations. 

Besides references in the literature to human 
relations, the importance of the area can be il- 
lustrated by the number of training programs 
in operation in many concerns. Many of these 
programs are concerned with human relations 
problems. Since World War II a basis for 
many programs has been the Job Relations 
Training course developed by the Training 
Within Industry organization (15). How- 
ever, criticism has been made of courses similar 
to the JRT course! because of the “package” 
presentation and the relative sterility when it 
becomes a matter of application (4, 5). 

It is the writer’s opinion that while these 
criticisms are quite valid, there are two more 
fundamental problems which must be first 
solved in human relations training. These 
‘problems involve needs for systematic knowl- 
edge of what to teach and how to measure out- 
comes. These limiting obstacles to the de- 
velopment of effective human relations train- 
ing appear to be of primary importance and 
worthy of research effort. 

The objectives of this exploratory investiga- 


* This article is in part drawn from the writer’s dis- 
sertation submitted to the Graduate School of the Ohio 
State University in partial fulfillment of the require- 
ments for the Ph.D. degree. Grateful appreciation is 
expressed especially to Prof. C. L. Shartle as the writer’s 
adviser, and to Profs. D. T. Campbell, D. D. Wickens, 
F. M. Fletcher, A. E. Coons, J. B. Rotter, and Drs. 
Melvin Seeman and John Hemphill for their advice and 
assistance. Grateful acknowledgment is also expressed 
to the Farm Bureau Insurance Companies of Columbus, 
Ohio, which established a Research Fellowship and 
stipend for the writer. Special thanks is given to 
Mr. J. G. Charles, Director of Training, Mr. H. E 
Evans, Director of Personnel, Mr. Murray D. Lincoln, 
President, and the departmefital personnel cooperating 
in the studies. 


tion represented an attempt to study these two 
problems. The objectives were: (A) to de- 
velop and present to a group of supervisors a 
course of systematic generalizations and prin- 
ciples covering a portion of the area of human 
relations; and (B) to attempt to evaluate the 
course by measuring certain dimensions of 
supervisory behavior which were thought to 
be amenable to change through the influence 
of the course. 


Development of Course 


The human relations training course was de- 
veloped in the home offices of three large in- 
surance companies as a part of the existing 
personnel training program. Since there was 
little published material adequately organized 
for immediate use as a text or readings, either 
for the instructor or the trainees, the writer 
developed a syllabus, the content of which was 
drawn from several social science areas. The 
major portion of the material came from the 
field of psychology because of the writer’s 
specialization and because of the belief that 
psychology offered the greatest systematization 
of facts and principles applicable to supervisory 
human relations training. 

A guiding principle was that the course 
should offer facts, principles, and generaliza- 
tions which would serve as a background for 
additional practice and “technique” training. 
Human relations training appears too complex 
to hazard an assumption that more adequate 
supervisory skills, attitudes, adjustments, and 
so on, could be developed without some basic 
understanding of the concepts, ideas, or even 
bare words involved in such development. 
The creation of means for description and ex- 
pression (communication) may be almost as 
important for supervisory development (10, 
pp. 582-583) as the creation of permissiveness, 
democratic attitudes, sensitiveness to attitud- 
inal “frames of reference,” cognition of em- 
ployee status symbols, etc. 
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Helpful in drawing up the syllabus was the 
description of a human relations training pro- 
gram developed by Maier (7). He developed 
an extensive course which was directed toward 
attitude change and the development of demo- 
cratic leadership. 

Three course objectives were evolved: (1) 
to establish facts and principles concerning 
psychological aspects of behavior and group 
functioning to enable supervisors to become 
more competent in their knowledge and under- 
standing of human behavior; (2) to increase 
the capacity of supervisors to observe human 
behavior; (3) to present personality adjust- 
ment concepts aiding in the integration of any 
achievements made in the first two objectives. 

The topics were established in line with these 
objectives. Two major revisions in content 
were made, one after more data on training 
needs were obtained, and the other after a 
preliminary series with two groups of execu- 
tives. This series was composed of 11 sessions, 
two hours each. The final revision contained 
10 two-hour sessions. A_ lecture-discussion 
method was employed. 

The final topics were as follows: 1. The 
Supervisor and His Study of Human Nature; 
2. Personality; 3. Adjustment in Personality; 
4. Peculiar Personalities and Unusual Be- 
havior; 5. Motivation; 6. Attitudes, Opinions, 
and Morale; 7. Individual Differences; 8. 
Leadership and Communication; 9. Psycho- 
logical Aspects of Absenteeism and Turnover; 
10. Group Structure and Review. 


The Experimental Study 


To determine the influence of the training 
sessions upon certain aspects of supervisory 
behavior, selected tests were administered be- 
fore and after training to two groups of super- 
visors: an experimental group (E) which re- 
ceived training, and a control group (C) which 


did not receive training.’ The E group was 
composed of all supervisors from one depart- 
ment, and the C group contained all super- 
visors from two departments, making two 
groups of 18 each. All three departments per- 


1 Another control group was used for the purpose of 
lysis in accordance with an extension of the tradi- 
tional control group design presented by Solomon (12). 
A discussion of the obtained results will be presented in 
a separate article. 


formed the same tasks and the supervisors had 
essentially the same functions and problems. 

The E and C groups were quite similar in 
respect to means on the following variables: 
Age, about 30; years of education, about 13; 
and sex, 7 males and 11 females, 8 males and 
10 females respectively. Slight differences ex- 
isted between years of service with the com- 
panies (4.6 and 7.5, respectively), and between 
mean scores on a mental alertness test. But 
on these two variables the standard deviations 
were large and thus the differences were not 
statistically significant. Further, the statisti- 
cal technique used did not require prematched 
individuals or groups. Thus these groups were 
considered quite satisfactory for the objectives 
of this study. There was no time available 
for further replication with other departments. 

It should be noted that the department head 
and the two assistant department heads of the 
E and the two C departments, respectively, 
were participants in the preliminary course 
held for executives. This added an unmeas- 
urable variable. But it could not be done 
safely otherwise since the department heads 
had to be informed about what their supervis- 
ors and employees were being requested to do. 
It can be hypothesized, however, that any 
changes in the trained group over the untrained 
group may be considered as underestimations 
rather than overestimations. That is, the C 
departments’ executives might pass along at- 
titudes and information which would enable 
the C group supervisors to perform at a higher 
level than expected, and the course might have 
been better than the obtained results indicated. 

The Test Battery. Three earlier studies in 
the area of human relations research and train- 
ing suggested some tests useful for measuring 
the-effects of training. These studies were by 
Katzell (6), Sanford and Hemphill (11), and 
Meyer (8). 

Six tests were selected, yielding a total of 
twelve separate scores. It was thought that 
the tests would reveal the effects of the training 
upon the behavior of the trained supervisors in 
accordance with the three primary objectives 
of the course. The tests used are described 
below. 


1. General Facts and Principles. A 25-item 
test of general psychological knowledge, in- 
cluding personality concepts. It was de- 
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bon by the writer. Related to objectives 
1 and 3. 

2. How Supervise? Form M (2). The 
total and three sub-scores were used. Related 
to objectives 1 and 2. 

. General Logical Reasoning (Watson- 
Glaser, 16). Test 5 from Battery IIA was 
used. Related to objective 2. 

4. Social Judgment Test for Supervisors 
(Meyer, 8). A 37-item projective-type test 
which required ranking of 4 alternatives under 
each item. It was obtained from Dr. H. H. 
Meyer of the Detroit Edison Company and 
reproduced by his permission for experimental 
purposes. Related to objectives 2 and 3. 

5. Supervisory Questionnaire. A 10-item 
free-response questionnaire, developed by the 
writer, answered in writing by the supervisors. 
It was designed to determine their attitudes, 
approaches, techniques, and general orienta- 
tion toward understanding of human behavior 
ard human relations problems. The scoring is 
described elsewhere (1). Related to objectives 
2 and 3. 

6. Test of Ability to Estimate Group Opinion 
(TAEGO). A 60-item test used in conjunction 
with a morale questionnaire study. A morale 
questionnaire was distributed to the employees 
working under the supervisors in groups E and 
C. The test requested estimates by the super- 
visors of the percentage of the employees (by 
section and by department) holding certain 
opinions as indicated in the morale survey. 
Four separate scores were obtained: (a) Section 
Error; (b) Department Error; (c) Section Bias 
(under- or overestimation of either high or low 
morale items); and (d) Department Bias 
(same asc). This test was developed by Mr. 
James Sprunger (13) and Dr. D. T. Campbell. 
The morale study was conducted by Miss 
Bonnie Tyler (14). Related to objective 2. 


An additional criterion measure of the E 
group was proposed for analysis purposes. 
This was obtained by having the department 
head and assistant department heads jointly 
rank the E supervisors. This was not accom- 
plished until about the middle of the training 
program. It was not possible to obtain similar 
ranks for the C group. Ranks were used be- 
cause of their simplicity and the agreement of 
the ranking method with complex methods 
when there is a small number of cases. 

Statistical Methods. Two primary statistical 
analyses used will be described. Another in- 
volved the use of Solomon’s control group de- 
sign referred to previously. The two de- 
scribed here are: (A) level of significance of 
differences between predicted and obtained 


means; and (B) intercorrelation of the meas- 
ures by rank-order correlation coefficients. 

The first method involved the use of “A 
Regression Technique for Matching Groups” 
found in Peters and Van Voorhis (9, pp. 463- 
469). On the basis of a regression equation 
developed from the C group, predicted scores 
are calculated for the trainees. The means of 
the predicted scores and the obtained scores 
are then tested by Student’s ¢. In many re- 
spects this is a covariance technique similar to 
Fisher’s. 

Differences significant at the 10 per cent level 
were accepted. This is in the face of various 
recommendations that higher odds be de- 
manded with a small number of cases. While 
not rigidly demanding or highly significant, 


this level was used because of the (a) explora- 


tory nature of the project, (b) the technical 
nature of the course; and (c) the nature of the 
trainees. The trained supervisors were not 
expected to become highly proficient in such a 
short period of training time. 

The second analysis, rank-order correlation, 
was used to determine if the pattern of inter- 
correlations revealed any enlightening factors. 


Results 


Table 1 contains a summary showing means 
and standard deviations on each test for the E 
and C groups and the E group predicted means. 
In this table the trained group obtained means 
on all tests which were better than the means 
predicted on the basis of the test performance 
of the untrained group. 

Table 2 presents a summary of the results 
according to probability and odds statements. 
It will be seen that seven of the twelve measures 
have odds which were greater than the required 
level. These results, coupled with mean im- 
provement on each test, indicated the course 
reliably provided alterations in trainees’ re- 
sponses in the direction of improvement. 

Table 3 shows the rank-order correlation co- 
efficients between the pretest and post test 
measures and the department heads’ criterion 
ranking of the members of the experimental 
group. Several tests were significantly re- 

? To reduce printing costs Tables 3, 4A, 4B, 4C, and 
4D have been deposited with the American Documen- 


tation Institute. Order Document 2906 from American 
Documentation Institute, 1719 N Street, N. W., Wash- 
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Table 1 


Pretest and Post Test Means and Standard Deviations for Experimental and 
Control Groups (N = 18) 








Control Group 


Test Mean 


Experimental Group 


Predicted 
Mean 


Obtained 


S.D. Mean S.D. 





13.1 
13.9 


. General Facts and Principles 


65.8 
66.5 


. How Supervise? 


14.8 
15.6 


. H.S.? Supervisory Practices 


21.6 
22.2 


. H.S.? Company Policies 


29.4 
28.8 


15.3 
16.3 


95.0 
99.2 


. H.S.? Supervisory Opinions 


. General Logical Reasoning 


. Social Judgment Test 


31.8 
30.3 


. Supervisory Questionnaire 


. Section Error** 121.9 


123.5 


111.6 
115.8 


. Department Error** 


. Section Bias*** 134.8 


Po 125.2 


. Department Bias*** Pr 110.4 
Po 95.6 


2.55 
2.70 14.1 


14.76 
12.98 


2.93 
2.31 


13.4 2.89 
17.6 3.14 


18.33 
16.09 


3.82 
2.61 


5.48 
5.01 


66.4 


67.0 714 


14.8 
16.1 


6.46 
5.80 


23.5 
25.6 


8.50 
7.63 


28.1 
29.7 


11.05 
10.70 


4.82 
3.95 


15.1 


4.12 17.1 


17.54 
16.04 


84.6 
94.2 


30.7 4.99 


4.17 


48.90 
50.90 


41.83 


48.90 71.4 





* Note: Pr = Pretest and Po = Post test. 


** Note: In the “Error” scores, the lower scores are the better, i.e., less error. 
*** Note: In the “Bias” scores, the higher scores are the better, i.e., less underestimation of the goodness of 


morale. 


A constant of 140 was added to each bias score to make each positive (easing statistical handling), All 


means indicate an underestimation of the goodness of morale, inasmuch as they would be negative but for the 


constant added. 


lated to the criterion ranking, both before and 
after training, but rather large shifts occurred 
only in two notable cases, the Supervisory 
Questionnaire and the Section Error score. 
After training, fairly high correlations were ob- 
tained between the criterion and the following 
ington 6, D. C., remitting $0.50 for microfilm (images 
1 inch high on standard 35 mm. motion picture film) or 


$0.50 for photocopies (6 X, 8 inches) readable without 
optical aid. 


tests: General Facts, 0.44; How Supervise? 
(Total), 0.53; H.S.? “Supervisory Practices,” 
0.48; H.S.? “Company Policies,” 0.66; H.S.? 
“Supervisory Opinions,” 0.44; General Logical 
Reasoning, 0.68; and Supervisory Questionnaire, 
0.41 (shift from —0.06 pretraining coefficient). 
The TAEGO “Section Error” coefficient was 
0.50 before training, and 0.22 after training. 
The overall effect, however, was an increase in 
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correlation after training between the tests and 


criterion, indicating the training apparently | 


brought the trainees more in line with the 
bases used for ranking by the department 
heads. 

Tables 4A, 4B, 4C, and 4D (deposited in 
ADI), show the intercorrelation matrices ob- 
tained for the control and experimental group 
pretest and post test measures. Three sets of 
facts were determined from these matrices. 

First, there was a large increase in the num- 
ber of significant coefficients (i.e., 0.43 is sig- 
nificant at the 5% where n=17) (3) in the 
experimental results whereas there was a large 
decrease in the number of significant control 
coefficients. The E group had 16 significant 
coefficients on the pretest and 31 on the post 
test; the C group had 22 on the pretest and 12 
on the post test. The training appeared to 
have been responsible for bringing each trainee 
to a similar rank-order position. 


Second, there were differential changes in 
the test standard deviations for the two 
groups. The test scores were dispersed Jess on 
the post training tests in the E group than on 
the pretraining tests (2 post test S.D.’s were 
larger than pretest S.D.’s while 10 were smaller). 
This was not true for the C group (8 post test 
S.D.’s were larger than pretest S.D.’s while 4 
were smaller). Thus it appeared that the 
trainees became more “alike” as a result of 
training. 

Third, the average intercorrelation coeffici- 
ents of the ranks were computed for the pre- 
and post test matrices. For the E group the 
pretest coefficient was .24 and the post test was 
.30; for the C group the pretest coefficient was 
-10 and the post test was .00. Thus, it ap- 
peared that the operation of chance factors was 
reduced in the trained group, but increased in 
the untrained group. Knowledge of psy- 
chology was apparently increased in the 


Table 2 


Summary of Results in Terms of Probability and Confidence Statements 











Ratio of Difference 
between Predicted 
and Obtained Mean to 
Its Standard Error 


Odds (that the 
differences would 
be greater than 
zero in the same 
direction with 
other samples) 


Probability 

(Student’s 

“4% N= 18, 
n = 17) 





. General Facts and Principles 4.84 
. How Supervise? ~ ~ 
(Total) 
. How Supervise? 
Supervisory Practices 
. How Supervise? 
Company Policies 
. How Supervise? 
Supervisory Opinions 
. General Logical Reasoning 
. Social Judgment Test for 
Supervisors 
8. Supervisory Questionnaire 
Test of Ability to Esti- 
mate Group Opinion: 
9. Section Error 
10. Department Error 
11. Section Bias 
12. Department Bias 


0.26 
2.09 
0.79 
2.13 


0001 9999 to 1* 


0411 23.3 to 1* 


.2039 3.9 tol 


0465 20.5 to 1* 
12.2 to 1* 
.1805 4.5 tol 
2.3 tol 
14.6 to 1* 


3921 
.0270 
.2204 
0241 


1.5 tol 
36.0 to 1* 
3.5 tol 
40.5 to 1* 





* The asterisk indicates that the odds were as great or greater than the 10 to 1 odds established for accepta- 
bility by the writer. 
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trainees by means of general principles and 
facts while the C supervisors did not gain. 
However, the initial disparity could not be 
explained. 

Discussion 

A problem occurs in reporting and discussing 
conclusions drawn from an investigation of this 
type wherein there seems to be no prior sys- 
temized body of knowledge. The investigator 
dees not have many previous research con- 
clusions, either confirmatory or contradictory, 
which will enable comparison with those ob- 
tained. Hence, the major portion of this dis- 
cussion must be related primarily to the ob- 
jectives established. 

The first course objective was considered ade- 
quately met. Significant gains were found on 
the General Facts and Principles and How 
Supervise? tests; higher post test correlations 
were found between measures known to be re- 
lated in content, especially the intercorrelations 
of General Facts and Principles, the “Super- 
visory Practices” and “Supervisory Opinions” 
sections of How Supervise?, and the Super- 
visory Questionnaire; and a higher average 
intercorrelation coefficient was found for the 
experimental post test scores. These various 
lines of evidence indicated that not only was 
knowledge of psychological facts and principles 
increased, but the trainees seemed to have ap- 
plied: the knowledge generally, as indicated 
especially by the higher intercorrelations. In 
this connection it must be remembered that 
gains greater than predicted were found on all 
measures. 

The second objective was considered parti- 
ally met. Significant gains were found on 
How Supervise?, Supervisory Questionnaire, 
‘Department Error” and “Department Bias”’ 
scores of the TAEGO; higher correlations 
among these tests were found also. Since 
significant gains were not found on all tests 
related to this objective (especially General 
Logical Reasoning, Social Judgment Tesi, and 
the “Section” areas of the TAEGO), it was 
considered only partially met. But evidence 
was found that the trainees were more sensi- 
tized to behavioral acts, expressions of atti- 
tudes, and group differences. The clearest 
evidence was found in the Supervisory Ques- 
tionnaire items and the TAEGO (“Depart- 


ment Scores’). On these measures respect- 
ively the trainees revealed greater understand- 
ing and insight into employee behaviors and 
greater accuracy and less bias in judging 
opinions held by employees. 

The third objective was considered as parti- 
ally met. Significant gains were found in the 
General Facts and Principles test and the Super- 
visory Questionnaire. The higher post test 
average intercorrelation coefficient for the E 
group also indicated the training enabled the 
trainees to react to the tests in a systematic 
manner. Since this objective was stated as 
involving the use of personality concepts as 
integrating and unifying bases for learning, it 
would appear that some success was attained, 
but not as much as desired. 

Aside from the objectives, certain other con- 
clusions were drawn: 


1. In terms of changes in the trainees’ test 
performance the course was found to be of 
value for these companies’ supervisors. 

2. The trained supervisors became more 
similar in abilities measured by the test battery 
as shown by reduced standard deviations and 
higher intercorrelations of a majority of the 
measures. 

3. The trained supervisors became more ac- 
curate in estimating the opinions of employees 
in their department. This was not found to 
be true for section estimates however. Per- 
haps the group meetings could account for this 
fact (since all trainees were drawn from one 
department and had the opportunity of ex- 
changing observations). Generally, the super-. 
visors underestimated the morale level, but 
not as much after training. 

4. The trained supervisors agreed more 
closely with personnel and training specialists 
on the nature of valuable supervisory and com- 
pany employee relations policies and proced- 
ures (How Supervise? results). 

5. The “Supervisory Practices” section of 
How Supervise? was found to be significantly 
related to all but two other tests in the experi- 
mental post training results. Since all tests 
were selected as related to supervisory prac- 
tices, and this section purported to measure the 
same, it was concluded that the course content 
had been related to supervisory practices by 
some systematic set of principles. 
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6. The Supervisory Questionnaire indicated 
that the trainees were apparently enabled to 
make more adequate and qualitatively better 
responses to “open-ended” questions concern- 
ing supervisory problems. In addition, this 
measure correlated significantly with a major- 
ity of the other tests, again indicating a common 
set of systematic principles. 

7. The Social Judgment Test for Supervisors 
was found to be negatively correlated with a 
majority of other measures. Magnitude of the 
negative correlations increased after training. 
It was thought that the “brighter” trainees 
probably reacted to very slight clues given in 
the descriptions of the people involved in the 
problem situations. This appeared to score 
against them according to the key. Further, 
there was slight evidence that the “better” 
trainees saw through the projective hypothe- 
sis. In either case, the “better” trainees 
seemed to be in conflict about this test, which 
presumably yielded them poorer scores. 

8. The General Logical Reasoning test was 
found to be highly correlated with several other 
pretest measures. However, the magnitude of 
the correlations increased on the post test, in- 
dicating that the training presumably enabled 
greater application of logical reasoning to the 
questions and situations. 

9. By comparing obtained scores against 
predicted scores for the top 27% and the bot- 
tom 27% of the trainees, it was found that 
those holding highest scores initially gained 
the most on all measures except the TAEGO. 
In this case those scoring lowest initially gained 
the most. 

10. A morale survey conducted among the 
employees in the E and C supervisors’ depart- 
ments, after the training period, revealed im- 
provemer:ts in mean section morale scores, Lut 
the gains were about equal in each case. The 
morale was quite high initially, however, which 
might have accounted for the lack of any im- 
provement in the experimental department 
over the control (13). 


Limitations 
Several limitations to this general project 
must be noted. 
1. It was suggested by several observers 
within the companies that some of the most 
striking changes appeared in the form of more 


adequate personality adjustments of the course 
participants. No adequate estimates of such 
changes and their reliabilities were obtained. 

2. Certain analyses of human relations train-' 
ing needs were thought necessary but could not 
be undertaken in this study. 

3. The training methods used were not ade- 
quately studied. Other methods might have 
been more effective. 

4. One of the more crucial criteria would 
have been analytical studies of changes in 
interpersonal relations among employees and 
supervisors, particularly any reduction in con- 
flict, changes in social structures, patterns of 
group behaviors, etc. Such evaluations could 
not be attempted. 


Suggestions for Future Research 


1. An analysis of the individually trained 
supervisor’s “social insight” or ‘‘awareness”’ 
seems necessary. The technique probably 
would involve pre- and post training “open- 
ended” interviews, with coding and rating of 
content. 

2. Relationships and group structures, 
through a sociometric analysis by a trained 
interviewer or observer, should be more 
thoroughly studied before and after training. 

3. An extension of indirect attitude assess- 
ment instruments similar to the TAEGO seems 
desirable. This may require the use of various 
batteries of questions varying with the situ- 
ation. 

4. Additional complementary courses ap- 
pear desirable. In the writer’s opinion, a 
valuable course for supervisors would be 
“Sociology and Human Relations” (as this 


‘might have been “Psychology and Human 


Relations”’). 

5. Human relations training courses should 
probably be given to various levels of an organ- 
ization above the supervisory level and studies 
of the effects made. Just where to center the 
training for optimal effects is unknown at 
present. 

6. Studies of the possible changes in organ- 
ization communication as a result of human 
relations training are needed. Effective com- 
munication appears to be a function of motiva- 
tion for communication. Such training may 
increase the motivation. 
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7. Lastly, this study is thought to be best 
regarded as propaedeutic for future research, 
especially in the area of critical tests and meas- 
ures. A great deal more research is required 
before the optimal nature of human relations 
courses and training can be specified. 


Summary 


An exploratory study in the area of super- 
visory human relations training was conducted 
in a large insurance organization. Two pur- 
poses were: (A) development and experimental 
evaluation of a human relations training course, 
and (B) evaluation of certain instruments se- 
lected as criterial measures. The course was 
presented to all supervisors of one department. 
A comparable group served as control. The 
measures were administered before and after 
training. Statistical analyses included the use 
of Peters and Van Voorhis’ “A Regression 
Technique for Matching Groups” and inter- 
correlating the measures using rho. 

The major results indicated that the training 
supervisors obtained means on all tests better 
than the means predicted from the test per- 
formance of the untrained group; on a majority 
of the measures the improvement was found to 
be statistically significant; by means of the 
intercorrelation changes, the trained super- 
visors became more similar in the abilities 
measured; they became more able to estimate 
opinions held by department employees; they 
agreed more closely with experts on the nature 
of valuable supervisory and’ company employee 
relations policies and practices; they appeared 
to be more abie to apply logical reasoning; 
those holding the highest scores initially gained 
the most on a majority of the measures; a 
morale survey indicated improvement in em- 
ployee morale scores for both the experimental 
and control departments even though morale 
was quite high initially. In all, the study 
indicated that such a course could be con- 


sidered valuable for supervisors in these com- 
panies. 


Received A pril 10, 1950. 
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Methods for Scoring a Check-List Type Rating Scale 


Herbert H. Meyer 
The Detroit Edison Company * 


One of the most difficult tasks that the per- 
sonnel man must undertake is that of devising 
rating techniques by which the performance of 
workers can be evaluated with as much pre- 
cision as possible. One method for dealing 
with this problem, which is being mentioned 
with increasing frequency in personnel publi- 
cations, is the weighted check-list type rating 
scale. 

This technique has many advantages over 
the more commonly used numerical or graphic 
type rating scales. Probably the chief ad- 
vantage lies in the fact that differences in the 
raters’ interpretations of a characteristic are 
reduced. The rater, presented with a list of 
concrete descriptions of on-the-job behavior, 
is asked only to indicate which of the statements 
apply to the ratee under consideration. 

The method would probably be used much 
more widely than it is at present if it were not 
for the fact that the conventional procedure 
for constructing and scoring such a scale is 
quite laborious. The steps in this procedure 
are, first, to compile a number of statements 
describing the behavior of a worker in per- 
forming the particular aspect of his job that is 
to be rated. Secondly, a large number of per- 
sons who are familiar with the job are asked to 
judge the importance of each of the described 
behaviors to success on the job. Or, to put it 
another way, they judge the degree to which 
eaca statement is characteristic of a good or 
poor employee. The method of equal appear- 
ing intervals is usually used by the judges. 
That is, they sort the statements into a speci- 
fied number of piles (usually 11) which repre- 
sent equal-appearing steps on a scale from good 
to poor performance. Thirdly, the median 
pile number into which each statement is 
placed by the judges is computed as the scale 

, Value for that item. Also, the quartile devi- 
ation (Q) value of the judgments for each state- 
ment is computed as an index of ambiguity. 


* Dr. Meyer has recently joined the staff of The 
Psychological Corporation in New York. 
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The final scale is constructed by selecting from 
the items with satisfactorily small Q values 
those with scale values which define the range 
in approximately equal steps. 

The scale is conventionally scored by com- 
puting the median scale value of the items in- 
dorsed by the rater. However, Jurgensen, in 
a recent article (2), points out a fallacy in the 
scoring of a check-list scale by this method. 
He shows that, although all the statements with 
scale values above the mid-point on the scale 
usually describe desirable behavior, the person 
for whom all of these statements have been 
checked may be penalized. That is, a person 
for whom only two or three statements with 
very high scale values are indorsed will get a 
much higher score by the median scale value 
method of scoring than the person for whom 
all statements above the mid-point on the 
scale are indorsed. 

It is true that the statements with very high 
scale values represent more important and 
desirable behavior than the others. Never- 
theless, the implication is that the other de- 
sirable behaviors, not indorsed, do not apply 
to the person who scores highest by this method. 
This could be obviated by restricting the num- 
ber of statements to be indorsed. That is, for 
example, asking the rater to check the six 
statements which best describe the person to 
be rated. However, Jurgensen recommends 
a more satisfactory way of dealing with this 
fallacy of scoring. He suggests that the rating 
scale be scored by computing the sum of the 
deviations of the scale values from the mid- 
point of the scale. 


The Problem 


In a recent study to develop tests to measure 
“human relations” ability in supervision, a 
weighted check-list rating scale was used as one 
criterion measure. It was scored by comput- 
ing the median scale value of the statements in- 
dorsed. The question then arose of to what 

egree the results would change if the scale 
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were scored as Jurgensen suggests, that is by 
computing the deviation of the scale values of 
the items indorsed from the mid-point of the 
scale. It was then decided to determine also 
if the scale could be scored by a simpler method 
with approximately the same results. 

On the basis of an item analysis of the results 
it appeared that it might be possible to score 
the rating scale with little loss in precision by 
using only three different weights. By this 
system all desirable or favorable statements, 
that is, statements with scale values above the 
mid-point on the scale, would be scored the 
same. Each of these items indorsed by the 
rater would give one point credit. The un- 
favorable statements were divided into two 
groups on the basis of their ability to discrimi- 
nate between extreme groups. on total score. 
Those with weights near the middle of the 
scale were scored minus one, and those more 
extreme in unfavorableness minus two. 

Even this system might be simplified further 
by merely scoring all favorable statements 
plus one and all unfavorable statements minus 
one. As another alternative, perhaps it would 
be necessary only to score the scale by counting 
the number of favorable statements indorsed. 
Or, since a “generosity” error is often present 
in rating, perhaps it would be even better to 
score it by counting the number of unfavorable 
statements indorsed. For example, it was 
found in this study that a few raters would 
indicate that some of the unfavorable state- 
ments applied as a description of the ratee; 
but yet they would indorse most of the favor- 
able statements also, some of which might be 
contradictory to the unfavorable statements 
checked. 

It was decided to determine the reliability or 
stability of each of these methods of scoring 
the rating scale and the relationship between 
scores derived by each scoring method. 


Procedure 


Ratings by two different raters on each of 
100 persons were selected at random from all 
of the ratings obtained in the supervisory study. 
This provided a total of 200 completed rating 
forms. Each of these forms was scored by six 
different methods. These were labeled for 
convenience as follows: 


1. Median Scale Value. The median value 
of the weights or scale values of the items in- 
dorsed. 

2. Sum of Deviations. The sum of the de- 

viations of the weights for the items indorsed 
from the mid-point on the scale (the procedure 
that Jurgensen recommends). 
_ 3. Three Weight. Scoring all favorable 
statements indorsed plus one, and unfavorable 
statements either minus one or minus two, 
depending on their ability to discriminate. 

4. Two Weight. Scoring all favorable state- 
ments indorsed plus one, and unfavorable 
statements minus one. 

5. Unit Weight, Favorable. Score equals 
the number of favorable statements indorsed. 

6. Unit Weight, Unfavorable. Score equals 
the number of unfavorable statements indorsed. 


Product moment correlation coefficients 
were computed between the scores derived by 
each of these methods. In addition, the rating 
scores derived by each method for the 100 su- 
pervisors by half of the judges were correlated 
with scores by the other half of the judges. 
This is a commonly used method for determin- 
ing the reliability of ratings obtained. It was 
felt that this might give an indication of the 
reliability or stability of the scores or indices 
derived by each of the different methods of 
sooring the rating scale. 


Results 


Table 1 shows that in general there is a high 
correlation between the different methods of 


scoring the rating scale. These relationships 
drop off appreciably with the last two methods, 
where only half the items are considered in the 
scoring. Of the first four scoring methods the 
reliability coefficient for the scores computed 
by the median scale value method is highest. 
However, the differences between this coeffi- 
cient and the other three of that group are not 
significant. The differences between the re- 
liability of the scores by the median value 
method and the unit weight methods are sig- 
nificant at the 1 per cent level of confidence.' 
The differences between the reliability of the 
scores by the other three methods (in which all 
the items are used in the scoring) and the unit 
weight methods ‘are ‘significant at approxi- 


‘The correlation between the different arrays was 
taken into account in computing the significance of 
m= between the correlation coefficients (3, p. 
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Table 1 


Reliabilities and Intercorrelations for Different Methods of Scoring the Check-List 
Rating Scale (N = 200) 








Sum of 
Deviations 


Median Scale 
Value 


Three Weight TwoWeight Unit Weight, Unit Weight, 
(+1, —1, —2) 


(+1, —1) Favorable Unfavorable 





Median Scale 
Value 


i a 
(.87) 


94 70* 


(82) 
98 


Sum of 
Deviations 


Three Weight 95 
(+1, —1, —2) 

Two Weight 93 97 
(+1, —1) 

Unit Weight, 82 92 
Favorable 

Unit Weight, 92 
Unfavorable 


90 


85 89 


95 91 


97 69* 


.60* 
(.75) 

9 “> 
(.72) 





* The starred coefficients in the diagonal present the correlations between the scores derived by the corre- 
sponding method for the ratings assigned the supervisors by one half the judges with the ratings of the other half 
the judges. The figures in the parentheses present the estimated reliability coefficients as conventionally computed 


by applying the Spearman-Brown prophecy formula. 


mately the 5 per cent level of confidence or 
better. 


Discussion 


Theoretically a weighted check-list rating 
scale can be scored with greatest precision by 
taking into account the degree of favorableness 
or importance of each statement in the scale. 
Conventionally, this has been done by com- 
puting the median value of the weights for the 
items indorsed by the rater. Logically it is 
superior to consider these weights in scoring the 
scale by computing the sum of the deviations 
of the weights of the items from the mid-point 
of the scale. However, in this study it was 
found that praccically the same results could 
be achieved by scoring the rating scale by 
merely subtracting the number of statements 
with an unfavorable connotation indorsed by 
the rater from the number of favorable state- 
ments indorsed. The loss in precision of scor- 
ing by using still simpler methods, for example, 
counting only the favorable statements in- 
dorsed, is probably greater than the slight gain 
in ease of scoring would justify. 

If this result is found to be generally true in 
other check-list scales, it will have an important 


implication not only for simplifying the scoring 
procedure, but also for simplifying consider- 
ably the process of constructing such a rating 
scale. If the only consideration for determin- 
ing the score of a statement is to be whether it 
describes behavior which is either favorable or 
unfavorable on that job, then many fewer 
judges could be used than are necessary if the 
statements are to be differentially weighted to 
a more exact degree. Items on which there is 
any disagreement as to whether the behavior 
described is favorable or unfavorable could be 
eliminated. 

One criticism of constructing a scale in this 
manner might be that it would tend to elimi- 
nate items which would have scale values near 
the middle of the scale. However, there is 
evidence to indicate that this would not be a 
significant loss. Edwards demonstrated that 
in attitude scales, constructed and administered 
in a manner similar to that used with the type 
of check-list rating scale being considered here, 
the neutral items are non-differentiating (1). 
An item analysis of the statements used in this 
check-list scale also showed that those state- 
ments with weights near the middle of the 
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scale were non-differentiating. Applying an 
approximation of Thurstone’s criterion of ir- 
relevance, it was found that neutral items were 
as likely to be indorsed for ratees with either 
high or low total scores as they were for those 
who scored near the middle of the scale. 

It may be, of course, that the neutral items 
serve an important function even though they 
contribute nothing to the scoring of a check- 
list scale. They may serve as buffers to help 
conceal the true purpose of the scale. For ex- 
ample, if all statements are clearly either 
favorable or unfavorable, raters may be more 
influenced by “halo,” which this rating method 
seeks to diminish. This remains to be demon- 
strated. 


Summary 


A check-list type rating scale to measure 
supervisory ability was scored by six different 
methods. These included two methods in 


which precise differential weights, derived from 
the judgments of a large group of persons, were 
used. By the first of these methods, the 
ratee’s score was derived by computing the 


median value of the weights of the statements 
indorsed by the rater. The second of these 
was the logically superior method of computing 


the sum of the deviations of the weights for the 
items indorsed from the mid-point on the scale. 

It was found that the same results for all 
practical purposes could be achieved by a 
much simpler scoring method. By this method 
a ratee’s score would equal merely the number 
of statements indorsed that describe favorable 
behavior on the job less the number indorsed 
that describe unfavorable behavior. If a 
check-list scale is to be scored in this manner it 
also would simplify greatly the procedure for 
constructing the scale. It would eliminate 
both the necessity of the large judging group 
for assigning weights to the items and the 
statistical computations necessary to deter- 
mine the scale values and indices of ambiguity 
for the statements to be used in the scale. 


Received March 31, 1950. 
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Norms for Strong’s Vocational Interest Tests 


Edward K. Strong, Jr. 
Stanford University 


This article calls attention, first, to certain 
data regarding the Vocational Interest Test 
for Men and Women which have not previ- 
ousiy been available, and second, to a method 
of calculating mean scores of blanks on a given 
scale without scoring the blanks on the scale. 
The procedure has been found particularly 
useful in the preliminary stages of developing a 
scale and in determining mean scores of various 
groups on a scale. ! 

Many requests have been received for the 
means and standard deviations of our men-in- 
general group (P,,,) on the occupational interest 
scales; similarly for the women-in-general 
group (P,). These data are now available. 

The raw means and standard deviations for 
the critefion groups on their own scales have 
been supplied with the scoring stencils. To 
facilitate use of these statistics for all scales 
the data are being published in two tables, one 
pertaining to men’s scales and the second per- 
taining to women’s scales. 

Table 1 of the January 1951 revision of the 
Manual for the Vocational Interest Blank for 
Men supplies the. above information (2). 
More specifically the table gives the means 
and sigmas for the criterion groups and the 
men-in-general group on each of 45 occupa- 
tional scales. These data are given as raw 
and standard scores, except that standard 
scores are not printed for criterion groups, 
since such standard means are always equal to 
50 with a standard deviation of 10. 

The raw zero score expressed as a standard 
score and the mean chance score and its stand- 
ard deviation are also given for each scale in 
this table. 

A similar table for women’s scales will be 
published in the next revision of the Manual 
for the Women’s Blank (3).' 

1To reduce printing costs, this table has been de- 
posited with the American Documentation Institute. 
Order Document 3136 from American Documentation 
Institute, 1719 N. St., N.W., Washington 6, D. C., 
remitting $1.00 for microfilm (images 1 inch high on 


standard 15 mm. motion picture film) or $1.00 for 
photocopies (6 x 8 inches) readable without optical aid. 
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Calculation of the Mean Score without 
Scoring the Blanks 


For some time we have calculated mean 
scores directly from the weights attached to 
each item. Since one must have ascertained 
the tally of responses to each item in order to 
determine the weights, the data needed to cal- 
culate the mean score are already available. 
The procedure is as follows: 


Given an item with m responses to each of 
which are attached scoring weights, then a 
man’s score on the item is 


(WiXRi)+(W2XR2)+(W3XR3)+ ete. 


where W equals the weight and R equals the 
man’s response, a 1 if he marks that response, 
a 0 if he does not. 

The sum of such totals for all the items in the 
test is the man’s raw score. So far the pro- 
cedure is exactly that employed in scoring the 
man’s blank on the scale. 

Similarly, the mean score of m persons on a 
single item is 


1 
wl (W,Xnumber replying) 
+(W2Xnumber replying)+ (etc.) | 


Thus if 80 physicians like “surgeon,”’ 14 are 
indifferent, and 6 dislike the activity, then the 
mean score on the item is 


sgl (4x80)+(— 1X14) +(—4x6)] or 2.82 


The sum of such totals for all the items in the 
test is the mean raw score of the group. 

By this procedure the mean score of a group 
can be obtained without scoring the individual 
blanks on the appropriate scale. All that is 
needed are the weights for the different re- 
sponses to each item and the tally of responses 
of the group. 

If the weights are directly proportioned to 
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the responses of the criterion group, or in the 
case of the Vocational Interest Test, to the 
differences in responses between men-in-general 
and the criterion group, the above procedure 
should give the same mean score as obtained by 
scoring the blanks. But if the weights repre- 
sent groupings of differences, for example, a 
weight of 0 represents differences of 0 to 5, a 
weight of 1 represents differences of 6 to 17, 
etc., then the above procedure will give a mean 
score which approximates the mean score ob- 
tained by scoring all the blanks. When there 
are 400 items in the test, the approximation is 
very close. 

Mean scores of criterion groups obtained by 
scoring the criterion blanks on their respective 
scales are given in column one of Table 1 in 
the Manual. Corresponding mean scores ob- 
tained by multiplying the weights by the num- 
ber responding are given in column three of 
the table. The average difference between 
these two sets of means for 36 men’s scales is 
2.3 raw score, with a range from 7.6 to —3.2. 
When the raw scores are converted to standard 
scores, the average difference is 0.46 with a 
range from 1.33 to —0.64. On only four 
among 44 men’s and women’s scales does the 
difference between the two means exceed one 
standard score, or 0.1 sigma, and on twenty- 
seven of the 44 scales the approximation is less 
than 0.05 sigma. 

Differences between the two procedures may 
be attributed: first, to errors in calculation, 
which we believe are of minor importance as 
all the work has been most carefully checked; 
second, to the assigning of weights to grouped 
differences, referred to above; and third, to the 
discrepancy between number of blanks tallied 
and the number of blanks scored. It has been 
our custom to tally incompletely marked blanks 
for all items to which responses were given. 
But such blanks cannot be scored. In con- 
sequence the tally of responses upon which the 
scale is based has never been exactly equal to 
the tally upon which the norms are based. 
Regardless of the cause of discrepancies be- 
tween mean scores based on scored blanks and 
on weights multiplied by responses, the differ- 
ences in the two procedures are too slight to 
cause any real difference in the interpretation 
of results. 


Standard Deviation of Men-in-General Scores 


Although mean scores can be obtained with- 
out scoring the blanks, no corresponding 
method has been discovered for determining 
the standard deviation of the distribution. 

Although we have long wished to have the 
mean and sigma of the men- (and women-) in- 
general group, it seemed too great a task to 
undertake the scoring of 4,746 blanks, which 
constitute the men-in-general group, and then 
to combine the scores. Reference to the pro- 
cedure (1, p. 711 f.) will make the difficulty 
clear. 

As a substitute for sigmas based on all the 
4,746 blanks we have resorted to a sample of 
500 blanks which we believe gives results that 
must approximate the true sigmas. 

In setting up the men-in-general group, 
which represents not the total male population 
but business and professional men, a quota was 
assigned to 38 occupations, approximating the 
proportion of each of these occupations in 
the total population (1, p. 712 f.). Thus a 
quota of 1 was assigned to architect, a quota of 
2 to’ dentist, a quota of 6 to engineers. The 
total of these 38 quotas amounted to 106. 
Originally 114 blanks of architects were aver- 
aged and the averages used to represent the 
quota of 1 for architect, etc. In all the cases 
where an occupational scale existed, the blanks 
employed to represent that scale had approxi- 
mately a mean score of 50 and a standard 
deviation of 10 on their own scale, but not, of 
course, on other scales. In setting up the 
sample of 500 approximately five blanks were 
used to represent each quota of 1. Wherever 
the sub-group is represented by an interest 
scale, the members of the sub-group were se- 
lected so that their scores would average 50 
with a standard deviation of 10. Thus, if the 
quota was 1, the five blanks were selected as 
nearly as possible with scores on their own 
scale of 64, 57, 50,43 and 36. If the quota was 
2, the ten blanks had scores of 68, 61, 57, 54, 
52, 48, 46, 43, 39 and 32; etc. Where the sub- 
group is not represented by an interest scale, 
as for example, lumbering officials, the re- 
quisite number of blanks were taken at random 
from the file. 

Originally 4,590 of the 4,746 blanks could be 
scored on a scale and did average about 50 on 
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their appropriate scale with a sigma of 10. With 
the sample, 408 of the 500 blanks averaged 50 
with a sigma of 10 on their own scale. The 
remaining blanks in both groups, representing 
a quota of 32 of the total of 106, were selected 
at random. 

As a check upon norms based upon the 
sample of 500 we may compare the mean 
scores based on weights X responses and on the 
500 actual scores. Such data expressed in 
standard scores are given in columns five and 
six of Table 1 in the Manual. Means based 
on the 500 sample average .66 higher than 
means based on weights, the average difference 
being .97 with a range from 3.0 to —1.4. We 
would guess that the means based on weights 
are the more accurate of the two but the differ- 
ences are too small to have any appreciable 
effect upon calculations based on one and not 
the other. 

Standard deviations based on the 500 sample 
of men-in-general for all the occupational scales 
are given in column seven of the table. The 
average of 36 such sigmas is 12.4 which agrees 
very well with several estimates made by the 
writer in the past. But he is suspicious of the 
three sigmas on the banker, realtor and author 
scales which equal 10.0. One would expect 
that men-in-general would differ more on any 
scale than does the criterion group. 

Critical ratios of differences between means 
and percentages of overlapping of distributions 
are, of course, affected by size of standard de- 
viations. Broad interpretations of results 
are, however, little affected by differences as 
much as 2.0 in standard deviations which 
typically range between 8 and 15. The stand- 
ard deviations given in the tables for men, or 
women, in general, are the only data we have 
and we believe are sufficiently accurate to give 
significant conclusions. 


Zero Scores 


If raw scores are employed there is no need 
of reporting the zero score, as it is obvious. 
But if standard scores afe used there is often 
need to know what standard score equals the 


raw zero score. That information is given in 
column eight of the table (also in 1, p. 88), and 
can be readily calculated, of course, by using 
the formula for conversion of raw to standard 
scores (1, p. 65). . 


If there were only one occupational interest 
score in existence instead of many the chances 
are that raw scores would be employed and all 
scores would be interpreted in terms of the 


zero score. A plus score of 40 would be inter- 


preted to mean that the person had the inter- 
ests of the occupations, a minus score, that he 
did not have those interests. In many studies 
that have been reported, particularly where 
comparisons have been made between scores 
on the Vocational Interest Test and scores on 
aptitude or other interest tests, raw scores 
with the raw zero score separating plus and 
minus scores should have been considered in- 
stead of merely standard scores in which the 
zero score is lost sight of. 

One might suppose that the zero score would 
fall half way between the mean of the criterion 
and P, the point of reference, i.e., men-in-gen- 
eral group. Actually there are only three 
among 72 scales where P falls as far below zero 
as the criterion group scores fall above zero. 
These three are the revised women’s office 
worker scale (scores respectively of — 48.9 and 
31.3), the men’s’ printer scale (—44.5 and 44.4) 
and the men’s musician scale (— 55.4 and 53.8). 
In 35 among 45 men’s scales the mean of P, 
is below zero, the average of all scales being 
—15.3. The relationship is largely reversed 
with women’s scales, for in 15 among 24 scales 
the mean of P, is higher than zero, the average 
being 5.5.: 

The greater diiference in raw scores between 
means of men’s occupational groups and Pm 
than between women’s occupations and Py 
suggests that men’s occupational groups differ 
from all men more than women’s occupational 
groups differ from all women and suggest 
furthermore that men can be more easily dif- 
ferentiated than women in vocational interests. 
Since we do not know the relationship of Pm 
to all men nor the relationship of P, to all 
women we must emphasize the word “suggest” 
in the above statements and not conclude that 
the statements are proved, although the 
writer’s opinion is that the last statement is 
correct. 


Chance Scores 


The mean chance score on a scale may be ob- 
tained in two different ways: first, in terms of 
the weights of the separate items, and second, 
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by averaging the scores of a number of blanks 
which have been marked on a chance basis. 

The chance score on any item equals the 
algebraic sum of the weights attached to the 
possible responses to the item divided by the 
number of possible responses. With the Vo- 
cational Interest Test, the chance score equals 
1 the algebraic sum of the three weights for 
that item. Similarly the mean raw chance 
score on a scale is 4 the algebraic sum of the 
three weights for all the items constituting the 
scale. 

Chance scores so derived and converted to 
standard scores are given in column nine of 
Table 1 in the Manual. Chance scores based 
upon 40 blanks, the 400 items of which had been 
marked according to throw of dice, are not re- 
produced here (see 1, p. 88), but the standard 
deviations of such chance scores are given in 
column ten of the table. Mean chance scores 
derived in these two different ways differ on 
the average only .44 standard score, with a 
range of 2.24 to —.60. 

In general a heterogeneous group will have a 
mean occupational interest score approximat- 
ing the mean chance score. On the other hand, 


96 per cent of a criterion occupational group 


score above the mean chance score. Among 
women’s scales the percentages range from 82 
to 90 for business education teachers, stenog- 
graphers, housewives, and office workers. 
Among the remaining nineteen scales the per- 
centages range from 97 to 100, and average 
98.8. 

If the sum of all the weights on a scale 
equalled zero then the raw chance score would 
equal the raw zero score. For some unknown 
reason there are a few more minus than plus 
weights on 57 of 64 men’s and women’s scales. 
As a result the mean chance score of 34 men’s 
scales averages 26.8 and the zero score averages 
29.0, a difference of 2.2. The corresponding 
averages for 27 women’s scales are 26.1 and 
27.9 with a difference of 1.8. Although the 
mean chance score is on the average two stand- 
ard scores less than the zero score, the differ- 
ence is small and for most purposes the chance 
score may be considered equivdlent to the zero 
score. 

The shaded area on the Report Form was 
introduced to call attention to the zero and 
chance scores. It has aided in the interpreting 


of scores for counseling purposes. But too 
many have ignored the significance of these 
scores in research work. 

Chance scores should be considered in re- 
porting the validity of a test, for a test should 
yield a satisfactory proportion of scores out- 
side the range of chance. In terms of this 
requirement we suspect that several person- 
ality tests, including interest tests, would not 
make a very good showing. 


Norms for the Scales on the Women’s Blank 


A similar table to that described above for 
the men’s blank will be published when the 
Manual for the women’s blank is revised. 
Only a few comments need be reported regard- 
ing the norms for women, as, for the most part, 
comments regarding men apply to women also. 

In determining the standard deviation of the 
scores of the women-in-general group on the 
various scales a sample of 135 blanks was 
utilized. The revised women-in-general group 
is composed of 21 occupational criterion groups 
represented by 7,047 blanks, 3 occupational 
groups for which there are no scales, and 3 
groups of high school and college students, 
represented by 772 blanks. The blanks in 
each group were averaged and the 27 averages 
were combined to give the women-in-general 
group (3, p. 12). For the sample of 135 blanks, 
five blanks were selected to represent each of 
the 27 groups. In the case of the 21 criterion 
groups the five blanks were selected so as to 
iverage 50 on their own scale with a standard 
deviation of 10, as described above; in the case 
of the remaining 6 groups five blanks were 
selected at random. 

From the sample of 135 cases are derived 
means and sigmas of women-in-general for the 
occupational scales, expressed in standard 
scores; see columns six and seven, Table for 
Women’s Scales. Means based on the 135 
samples agree more closely with means based 
on item weights with the women’s scales than 
the corresponding data agree with the men’s 
scales. With the women the average deviation 
between the two means is .72, with the men it is 
.97. The mean based on the sample averages 
A8 higher than the mean based on weights in 
the case of the women and it averages .66 
lower in the case of the men. 
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There is no way to check the accuracy of the 
sigmas for women-in-general given in column 
seven of the table. The average of all the 
standard deviations is 12.6, which is approxi- 
mately that obtained in connection with the 
men’s scales. 

Chance scores derived from 4 the algebraic 
sum of the weights and converted to standard 
scores are given in the ninth column of the 
Table for Women’s Scales. Chance scores 
derived from 50 blanks which were marked by 
throw of dice are not reported. Mean chance 
scores derived in these two different ways 
differ on the average 0.58 standard score, with 
a range from 1.6 to —0.7. Chance scores 
based on dice average 0.3 higher than such 
scores based on weights. As there is very little 
difference between the two sets of mean scores, 
we are using those based on weights as they 
should more truly represent the mean chance 
score. As there is no way of calculating stand- 
ard deviations from the weights we supply the 
sigmas based on 50 scores for each scale from 
throw of dice. 


Validity of Occupational Scales 


Since an occupational interést scale was de- 
signed to differentiate between the interests of 
members of an occupation and men in general 
(P), the measure of validity of such a scale is 
the extent to which the two groups are differ- 
entiated. Validity of such scales when used 
for guidance of young people is another matter 
with which we are not concerned here. 

Two measures of validity are available: (1), 
the difference in mean scores between P and the 
occupational criterion group; and (2), the per- 
centage of overlapping of the total groups. 
The relationship between the two measures is 
slightly curvilinear. For one set of data the 
eta coefficient is .94 in contrast to the Pear- 
sonian r of .87. Similar conclusions, but not 
exactly the same conclusions, are accordingly 
obtained through the use of the two measures. 

Since per cent of total overlapping is more 
meaningful than mere differences between 
mean scores, we prefer to use the former. Such 
percentages are given in the last column of 
Table 1 in the Manual and also in Table 1 in 
this article. 

Occupational scales are listed in Table 1 on 


the basis of percentage of total overlapping, 
ranging from 17 to 53 per cent. The average 
overlapping is 31.4 for 36 men’s scales and 
35.2 for 23 women’s scales. When the differ- 
ence between the means of two groups has a 
critical ratio of 3 the percentage of overlapping 
is seventy. This statistical landmark makes 
clear that all the occupational groups are dif- 
ferentiated from P to a degree far beyond that 
required for statistical significance. Over- 
lapping of 30 to 35, the average for all the 
scales, is not too often encountered in studies 
of individual differences. Bear in mind we are 
employing tctal overlapping, not the percent- 
age that one group exceeds the median of a 
second group, which is quite another matter. 

On the basis of overlapping the four poorest 
scales are all men’s scales, namely, president of 
a manufacturing company, production man- 
ager, farmer, and realtor, with overlapping of 
53 to 45 per cent. 

Differences in overlapping between occupa- 
tions and P must be explained on the basis of 
the composition of both groups. It seems likely 
that the more homogeneous an occupational 
group the less the overlapping with P. It also 
seems likely that the smaller the number of 
people employed in the occupation the less 
the overlapping will be. A better way of ex- 
pressing this statement would be to say, the 
smaller the number of people engaged in an oc- 
cupation and closely related occupations, the 
less the overlapping will be. This proposition 
might explain the very low overlapping in 
interests with P of men psychologists, mathe- 
maticians, and Y.M.C.A. workers but not of 
carpenters. These are the four occupations 
with overlapping of only 15-19 per cent. Lack 
of the necessary data makes it impossible to 
explore the validity of these two propositions 
at the present time. 

P,, was constructed to represent the interests 
of business and professional men representative 
of the upper socio-economic strata. Since 
P,, is composed of twice as many business men 
as professional men (1, p. 712), we should ex- 
pect greater overlapping between P,, and busi- 
ness scales than with professional men’s scales. 
This is the case. The average per cent of over- 
lapping for 13 business scales is 39.4 in contrast 
to 25.8 per cent for 11 professional scales 
(critical ratio of 4.8). 





Norms for Strong’s Vocational Interest Test 


Table 1 


Percentage of Total Overlapping between Criterion and P, or Py 











Men’s Scales 


Percentage 


Psychologist (old), Mathematician, 


Carpenter 
Artist, Architect, Policeman, 


Women’s Scales 


Y. W. secretary 


Forest service, Minister, Musician 


Dentist, Chemist, Aviator, Printer, 
Y. M. physical director, Y. M. secretary, 


Psychologist, Life insurance 
saleswoman, Buyer 


Social science teacher, City school 
superintendent, C.P.A. (partner) 


Physician (old), Math. science 
teacher, Personnel manager, Banker, 


Advertiser, Author 
Accountant, Lawyer, Public 
administrator 


Engineer, Office man, Purchasing 
agent, Sales manager, Life insurance 


salesman 

45-9 
50-4 
Mean 
Sigma 


31.4 per cent 
9.3 


English teacher, Home economics 
teacher, Occupational therapist, 
Laboratory technician 

Artist, Author, Lawyer, Dietitian, 
Nurse, Math. science teacher, 
Elementary teacher, Physician, 
Stenographer (revised) 

Librarian, Social worker, Social 
science teacher, Housewife, 
Dentist, Office worker (revised) 


Production manager, Farmer, Realtor 
President of a manufacturing company 


35.1 per cent 
6.3 





P,, was not intended to represent the aver- 
age man in the United States who has a mean 
OL score of 50. Consequently occupational 
scales with mean OL scores below 55 should 
overlap much less with P,, than business and 
professional scales. This is true, for the aver- 
age overlapping of the five men’s scales of 
aviator, musician, printer, policeman, and 
carpenter with OL mean scores ranging from 
54.3 to 48.5, is only 23.6 per cent in contrast 
with 32.7 per cent overlapping on the remain- 
ing 31 scales. 

Each of the women’s occupations listed in 
the accompanying Table 1, excepting housewife 
and office worker, is about equally represented 
in the women-in-general group (Py). On this 
basis the overlapping between P, and the 
women’s occupational scales should be similar. 
Reference to Table 1 makes clear that the 
women’s scales differ in this respect from one 
another appreciably less than is the case with 
the men’s scales. The standard deviation 
from the mean overlapping of 22 women’s 


scales is 6.3 in contrast to 9.3 for the 36 men’s 
scales. The difference between the two sig- 
mas is 3.0 with a critical ratio of 2.1. 

The housewife and office worker group are 
not represented in Py. They were omitted 
because such interests correlate very highly 
with the interests of stenographers, primary 
education teachers, and nurses, which were in- 
cluded in P, and it seemed desirable not to 
overweight this type of interest in Py. Never- 
theless, these two occupations are among the 
six scales with greatest overlapping with P,, 
as may be seen in Table 1. 

In one very true sense occupational scales 
may be ranked as to their effectiveness as 
shown in Table 1. The scales were designed 
to differentiate between an occupation and P 
and they do so to the extent shown in Table 1. 
But from another point of view the validity of 
the scales remains unknown despite the above 
information. It may be actually true that 
43 per cent of housewives and P, overlap with 
regard to the interests of housewives. If that 
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is so, the present scale is 100 per cent valid. 
Overlapping can be used as a measure of valid- 
ity only when we bear in mind that the over- 
lapping which actually exists, not 0 per cent 
overlapping, must be viewed as the criterion of 
perfect validity and that greater or less over- 
lapping than the true amount represents in 
both cases a smaller degree of validity. 

The overlapping with P,, of 19 per cent on 
the carpenter scale and 53 per cent on the 
president scale indicates the extent to which 
these two occupational groups differ from the 
sample composing P,,. But we cannot say 
that the carpenter scale is far more valid than 
the president scale because we do not know to 


what extent the two actually overlap with P,,. 
As indicated above we do know that carpenters 
should not overlap with P,, to the same degree 
as presidents do. 
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Factors in the Return of Questionnaires Mailed to Older Persons 


Joseph H. Britton and Jean Oppenheimer Britton 
Department of Psychology, The Pennsylvania State College 


To the social scientist dependent in his re- 
search upon voluntary participation, the ques- 
tion of the validity of such samples is of funda- 
mental importance; investigations dependent 
upon subjects responding by mail must prove 
or disprove the existence of any biases or 
selective factors operating within the universe. 

Reuss; Shuttleworth; Pace; Edgerton, Britt, 
and Norman; Franzen and Lazarsfeld; Reid; 
and Suchman and McCandless have conducted 
studies of the mailed questionnaire as a tool for 
research, The findings of these studies have 
been reviewed elsewhere (5). Suffice it here 
to say that these studies were consistent in 
finding that respondents and non-respondents 
in mailed questionnaire surveys did not con- 
stitute homogeneous groups. Although the 
differences between them were frequently with 
respect to factors specific to the individual 
studies, it appears evident that respondents 
most generally have been those who are inter- 
ested in the topic under consideration; more- 
over, generally they have been those persons 
who have relatively strong ties of loyalty with 
the sponsor of the questionnaire inquiry. 

In this paper, the writers will present the 
findings of two research projects, the methods 
of which have been useful in discovering the 
nature of differences between respondents and 
non-respondents to questionnaires mailed to 
individuals in later maturity and old age. 


Procedure 


Each of the present writers has been con- 
cerned with the issue of sample biases in mailed 
questionnaire studies with older persons, one 
writer dealing with retired Y.M.C.A. secre- 
taries (1), the other dealing with a group of 
teachers retired from the Chicago public 
school (2,3). For each group a complete list 
of members was obtained, along with data on 
each of the members concerning age, date of re- 
tirement, etc. The writers have handled the 
problem of respondence and non-respondence 


by comparing statistically several types of 
participants and non-participants on various 
attributes. 

To 328 retired Y.M.C.A. secretaries was 
mailed a preliminary questionnaire dealing 
with problems of retirement. The schedule 
was returned by 256 secretaries (78 per cent). 
Sometime later a longer schedule, Your Activi- 
ties and Altitudes, was mailed to the same 328. 
Replies from the latter mailing numbered 165 
or 51.9 per cent. 

Then to each of those former secretaries who 
had not returned either questionnaire (i.e., 
those 57 who had cooperated in no phase of 
our research) was mailed a personal letter 
which asked him to answer three questions and 
to jreturn the answers in a postage-paid en- 
velope. The questions were: (1) Are you 
gainfully employed? (2) Do you feel you 
have fair economic security? (3) Do you 
think the Y.M.C.A. should operate some place- 
ment service for retired secretaries? Replies 
were received from 29 or a 50.8 per cent return. 

In general, the same procedure was followed 
in the study of the retired teachers, with certain 
exceptions. From a mailing of a preliminary 
questionnaire concerning retirement to 2,853 
retired teachers—305 men and 2,548 women— 
questionnaires were returned by 173 (56.7 per 
cent) of the men and 862 (33.8 per cent) of the 
women. Sometime later,’ the longer schedule, 
Your Activities and Altitudes, was mailed to a 
selected portion of the total number. From 
this selected sample of 253 men and 1,249 
women, schedules were received from 120 (47.5 
per cent) of the men and 373 (29.9 per cent) of 
the women. 

To secure additional data on non-respond- 


1 Prepared by Ernest W. Burgess, Ruth S. Cavan, 
and Robert J. Havighurst, and published by Science 
Research Associates, 228 South Wabash Avenue, Chi- 
cago 4, Illinois. See (4). 

? The Retired Teachers’ Association of Chicago mean- 
while had published a report of the findings of the 
preliminary study of retirement, and a portion of one 
=. on meetings was spent discussing the study. 
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ents (other than that supplied us on each re- 
tired teacher in the study population) a 20 per 
cent (with one exception) stratified quota 
sample of those men and women not responding 
to our mailing and who lived in the Chicago 
area was drawn for interviewing. The inter- 
viewing procedure was an attempt to secure 
data comparable to that provided by respond- 
ents in Your Activities and Altitudes and an ab- 
breviated form of that schedule was used.’ 

Having obtained certain items of information 
on non-participants as well as participants we 
were able to compare statistically certain 
groups within our study populations. We 
wished to know if age, for example, were a 
factor in relation to whether or not former 
Y.M.C.A. secretaries or teachers responded to 
our questionnaires. To provide such a test 
we have employed Karl Pearson’s chi-square 
test of significance.‘ 

Chi-square analyses of the data of the study 
of the Y.M.C.A. secretaries were made on the 
basis of several types of participants and non- 
participants. The Y.M.C.A. groups com- 
pared were these: 


(1). The R group, all men returning the 
short schedule, A Study of Retirement, 258 
retired Y.M.C.A. secretaries. 

(2). The Z group, all men returning the long 
schedule, Your Activities and Altitudes, 165 
men. 

(3). The S group, the men returning only 
the short questionnaire, 106 men. 

(4). The X group, men who answered our 
personal letter asking for answers to three 
“vital” questions, 29 men. 

(5). The Z group, retired secretaries who 
have not cooperated in any phase of our re- 
search, 28 men. 


* The problem of the availability of older persons for 
interviewing is a related problem of importance. Our 
data concerning interviewees are those gathered from 
those individuals whom we were able to contact. With 
the retired teachers, this was almost three-fifths (57.7 
ad cent) of the sample of men and women interviewees. 

(2, pp. 34-40). Robert J. Havighurst, in a study 
mid-western community, 
concluded that “the groups with the most resistance to 
being interviewed are upper (Class women and lower- 
middle class women and men.” See (6, p. 166). 

‘ The standard method for four-fold tables was used, 
with Yates’s correction when needed. 


of old iy in a “typical” 


Where data were available, statistical com- 
parisons of these groups were made with the 
items listed in Table 1. 

The analysis of ,respondence and non-re- 
spondence in the retired teachers’ group was 
carried out similarly,but was a slightly more 
complex process because the teachers’ group 
included both men and women. Also, the 
sample in the latter part of the investigation 
was a Stratified-random sample of the study 
population, which made additional compari- 
sons desirable. The interviewees, moreover, 
were a particular segment of the non-respond- 
ing sample, and this made additional statistical 
comparisons necessary. Essentially, however, 
the procedure was the same for the two research 
projects. 

With the retired teachers’ groups, where ap- 
propriate, chi-square comparisons were made 
utilizing the following data: (1) sex, (2) pres- 
ent age, (3) date of retirement, (4) reason for 
retirement, (5) position before retirement, (6) 
nativity, (7) employment status, (8) marital 
status, (9) education, (10) health evaluation, 
(11) with whom living, (12) frequency of seeing 
family, (13) attending meetings, (14) attend- 
ing religious services, (15) listening to church 
services over radio, (16) feeling of economic 
security, and (17) a score of feelings of happi- 
ness. 

Thus in both o¢ these investigations con- 
cerning the adjustment of retired individuals 
statistical comparisons have been made which 
make use of data on each person in the study 
populations. In the case of the study of 
Y.M.C.A. secretaries, supplementary data 
were secured from some non-respondents by 
personal letters; in the study of retired teachers, 
supplementary information was obtained from 
some non-respondents by means of interviews. 


Results 


Table 1 shows the results of the statistical 
comparisons, using Pearson’s chi-square test 
of significance, of the various groups of re- 
spondent and non-respondent retired Y.M.C.A. 
secretaries. Since the procedure was the same 
with the retired teachers, the data from that 
investigation will only be summarized here. 

As shown in Table 1, the two groups of non- 





Factors in Return of Questionnaires Mailed to Older Persons 


Table 1 


Probability Values for Chi-Squares from 2 X 2 Contingency Tables Comparing Groups of Respondent and 
Non-Respondent Retired Y.M.C.A. Secretaries on Certain Attributes 











Attribute X vs. Z 


Svs.X+Z Rvs. X+Z Lvs.X+Z 





Present age—under 70—70 and 

over .30 
Date of retirement—before 1935— 

1935 and after 95 
Age at retirement—under 60— 

60 and over 95 
Number of years’ service—under 

25 years—25 years and over .90 
Branch of “Y” work—Gen’l Sec’y— 

other 50 
Population of town of residence— 

under 10,000—10,000 and over 10 
Geographical mobility —moved- 

not moved 90 


.90 90 


30 90 
(60+) (60+) 
01** .01** 
(25+) (25+) 
O1** 01** 
(Gen’!) 
50 .05* 


.20 50 90 
(moved) (moved) (moved) 
.O5* .O1** .O1** 





Note: Probability values which indicate statistically significant differences are marked with “*” if significant 


at the 5 or 2 per cent levels, and “**” if significant at the 1 or less than 1 per cent level. 


For each significant 


difference, the direction of that difference is noted in parentheses above the significant probability value. The 


cell category having the greater proportion among respondents is thus shown. 


For example, when S vs. X+Z 


are compared on age at retirement, the probability value is significant; the “(60+)” above the “.01**” indicates 
that the respondenis (S) included a significantly large number of men who retired at 60 and over. 


respondents, X and Z, are homogeneous.’ In 
our comparisons, then, these patterns appear 
clear: Those Y.M.C.A. secretaries cooperating 
in our research by returning questionnaires, 
either short or long ones or both, are men who 
have, for the greater part, retired at 60 years of 


age or older and who have served the Y.M.C.A. 


for 25 years or more. One other difference 
appears evident, viz., that those persons co- 
operating to the fullest extent, the R group 
(which contains most of the Z group), include 
more former secretaries who have moved since 
retirement than one would expect by the laws 
of chance. 

With the investigation of the retired school 
leachers the procedure, as outlined above, was 
similar to the study of the former Y.M.C.A. 
secretaries. The results of computations to 
discover sample biases will only be summarized. 
In both our preliminary and current investi- 
gations of retired teachers, the men responded 
better than the women. Chi-square compari- 
sons of responding and non-responding men 
showed one significant bias, viz., that the re- 


5 Additional statistical comparisons of S versus Z, 
R versus Z, and L versus Z bear this out; since X and Z 
are similar, those additional comparisons have been 
omitted from the table. 


| 


spondents tended to be those who had held 
administrative or college teaching positions 
before retirement rather than those who had 
held elementary or high school positions. 
With the women, a similar situation existed 
(except that the division had been made with 
elementary teachers versus other); in addition, 
the responding women fended to be those who 
were under 75 years of age, hence those who 
had retired recently, and those who were 
living outside Chicago in communities under 
10,000 population. Information from women 
interviewees revealed some indication that re- 
sponding women tended to be native-born in- 
dividuals who were in good health and who 
were economically secure. 


Summary 


Of importance in any study dependent upon 
volunteer subjects—and especially significant 
in a study involving mailed questionnaires in 
an effort to assess the adjustment of older per- 
sons—is the determination of the existence or 
non-existence of selective factors in respond- 
ence and non-respondence. We have handled 
this problem by comparing participants and 
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non-participants on various attributes, utiliz- 
ing data available on the total study population 
as well as that secured by letters and interviews 
of non-respondents. Our investigation has re- 
sulted in suggestions of factors influencing 
participation in mailed questionnaire studies in 
the field of old age. There may well be dyna- 
mic factors of motivation, such as psychological 
satisfactions from positions, personal identi- 
fications, and emotional security, as well as 
values and attitudes concerning socioeconomic 
status, which are related to these objective 
data. Additional methodological studies cop- 
ing with the problem of the representativeness 
of responding groups in studies with other older 
persons are needed. There is immediate need, 
too, for information on the efficacy of certain 
types of approaches, types of letters, and 
sponsorship of studies, etc., in securing full 
participation of older adults in research proj- 
ects. Clinical material dealing with analyses 
of the feelings older individuals reveal for par- 
ticipating or not participating would be useful 
in planning research. It is of the utmost im- 
portance for investigators to learn not only 
how highest percentages of participants can be 


obtained but also to account for any existent 
sample biases. 


Received March 9, 1950. 
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Reliability of Personal Interview Data 


Charles L. Vaughn 
The Psychological Cor poration 


and 


William A. Reynolds 
Batton, Barton, Durstine & Osborn, Inc.* 


Audience studies and similar surveys are 
often reported in terms of breakdowns by age, 
education, and socio-economic level of re- 
spondents. The meaningfulness of the results 
obviously depends upon the reliability of these 
measures. The present analyses were made to 
ascertain the reliability of reports of age, 
education, and socio-economic level as ob- 
tained under certain rather common conditions 
in personal interview surveys. Other investi- 
gators have previously reported’? results of 
similar studies. 


The Samples Interviewed 


Original and repeat interviews were made 
about three months apart with each of two 
groups of adults. The first group was com- 
posed of 888 adults in Des Moines, Iowa; the 
second group was composed of 430 adults in 
Springfield, Massachusetts. The original in- 
terviews were made in each city in April, 1948; 
the repeat interviews were made with the same 
groups in July-August, 1948. 

The names of individuals to be interviewed 
in April were drawn at random from recent city 
directories of all adults (approximately 20 
years of age and older) in the respective cities.’ 


*Mr. Reynolds was Research Associate, National 
Broadcasting Company, when the study was conducted. 
The work herein reported represents special analyses of 
data secured for other purposes in a study conducted 
by the Marketing and Social Research Division of The 
Psychological Corporation for National Broadcasting 
bay ye and Columbia Broadcasting System. 

1 Cantril, Hadley, and Research Associates in the 
Office of Public Opinion Research. Gauging public 
opinion. Princeton, N. J.: Princeton University Press, 
1944, pp. 98-106 (Ch. VII, by Frederick Mosteller). 

? Campbell, Angus. Attitude stability and change; 
a re-interview study of the national population. Amer. 
Psychologist, 1948, 3, 272 (abstract of paper read before 
American Psychological Association). 

* When persons had moved, however, adults living in 
the same dwelling unit were listed, and one person was 
selected at random from among those individuals. 
Substitutions, of course, were permitted only in the 
April samples. 


In other words, the sample of individuals was 
pre-designated. Several call-backs were made 
to reach as many as possible of the individuals 
originally designated for interview. Actually, 
1,034 individuals were interviewed in Des 
Moines in April, and 527 in Springfield; but 
there was shrinkage in the samples from April 
to July due to refusals, inaccessibility, or in- 
completeness of information. Several call- 
backs were made, however, to assure as full a 
recovery as possible. 


Place of Interview 


All interviews were made personally in the 
home, except in an occasional case when the 
interviewer was advised by a member of the 
household to go to a place of business or else- 
where to reach the respondent expeditiously. 


The Interviewers 


Reliability information on measures of this 
sort may, of course, be heavily influenced by 
the interviewers. The interviewers were, for 
the most part, upper classmen from the col- 
leges located in the respective cities—Drake 
University in Des Moines, and Springfield Col- 
lege in Springfield—recruited by the Research 
Associates of The Psychological Corporation 
at the two institutions. Most of the inter- 
viewers had not had previous experience in 
survey work. They were trained specifically 
for and supervised on these surveys by staff 
members from the New York office of The 
Psychological Corporation. Cheating and care- 
less work were adequately prevented by follow- 
up calls, inspection of each day’s work, etc., 
by the supervisors. 

In general, the same set of interviewers did 
not conduct both the April and July-August 
studies. Nor did the July-August inter- 
viewers have a record of the information ob- 
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tained in April. A total of 26 interviewers 
participated in the April Des Moines study; 
25 in the July-August study. Of these, five 
interviewers worked on both the April and 
July-August studies. In Springfield, 17 inter- 
viewers worked in April and 11 in July-August. 
There was no overlap in interviewers in Spring- 
field between April and July-August. 

Different supervisors worked in Des Moines 
and Springfield. The same supervisors, how- 
ever, handled the operation in Des Moines 
both in April and again in July-August; this 
was also true for Springfield. 


Instructions to Interviewers 


Age. The following item appeared toward 
the end of the questionnaire blanks: 


‘Approximate age of person interviewed: 
Below 20__; 20-29__; 30-39__; 40-49__; 
50-59__; 60 and over__.”’ 


The position of the item was that usually re- 
served on consumer questionnaires for classi- 
fication data and followed the basic ques- 
tioning, which took/about 15 minutes. 

No special emphasis was placed upon the age 
item as such. Interviewers were told, “You 
may estimate the person’s age, but it is better 
to ask. If properly approached, few people 
will hesitate to give their ages.” 

Educational Status. The following question 
appeared toward the end of the questionnaire 
blank: 


“Would you mind telling me the last grade 
you completed in school? 8th or below__; 
9-10__; completed high school__; some col- 
lege__; completed college__; D.K.__..”’ 


No special. instructions were given on the 
question. 

Socio-Economic Ratings. In many of the 
studies conducted by The Psychological Cor- 
poration and other agencies, the following 
socio-economic levels are designated: 


Socio-Economic 


Group Per Cent of Households 


A Top 10 per cent 
B Next 30 per cent 
Cc Next 40 per cent 
D Bottom 20 per cent 





Interviewers in the present study were di- 
rected to rate respondents in terms of these 
four categories. In this connection, it is im- 
portant to note that practically all interviews 
were conducted in the home. Interviewers 
were carefully instructed to distinguish be- 
tween different socio-economic levels in terms 
of home ownership, general appearance of the 
home, its size, possessions of electric refrigera- 
tors and automobiles, manner of speech of the 
respondents, and so on. Socio-economic maps 
already available for the respective cities were 
used to guide the interviewers as to what level 
homes they might generally expect to find in 
the various sections of the towns. 

There is one important difference between 
the manner of obtaining socio-economic ratings 
in this study and that followed in the more 
common quota control surveys. In this study, 
individual respondents were pre-designated, 
and the original socio-economic ratings were 
made post hoc, so to speak. In quota control 
studies, interviewers are instructed to go out 
and get so many “‘A’s,” “‘B’s” and so on; in the 
more tightly controlled quota samples, inter- 
viewers are assigned to specific areas where 
they may expect to find the respective groups. 
When interviewers are free to choose respond- 
ents within quotas, the original-repeat reli- 
abilities may be quite different from what they 
are when the original choice of respondent is 
pre-determined for the interviewer. 


Results 


Table 1 shows the product-moment correla- 
tions between original (April) and repeat re- 
ports (July-August) of age, education, and 
socio-economic status for Des Moines and 


Table 1 


Product-Moment Correlations between Original (April) 
and Repeat Measures (July-August): Age, 
Education, Socio-Economic Level 








Des Moines Springfield 


r N 
Age 85 856 
Education 82 861 
Socio-Economic Level 61 835 








r = product-moment correlation coefficient; N 
number of cases upon which r is based. 





Reliability of Personal Interview Data 


Table 2 


Product-Moment Correlations between Original and 
Repeat Measures (Cantril): Age, 
Economic Status 
Note: Data from Cities Over 100,000. 


Age 
Economic Status 


Springfield. The N’s (sample size) vary some- 
what due to the exclusion of a cage from the 
correlations when there was a “don’t know” 
or “no answer” for the individual in either 
April or July-August. This procedure should 
not materially affect the sizes of the coefficients. 

These correlations would, however, be raised 
somewhat by applying Sheppard’s correction 
for coarse groupings. This statement is par- 
ticularly pertinent to the correlations with re- 
spect to socio-economic groups, where there 
are only four coarse groupings. 

The Office of Public Opinion Research’s 
results for age and socio-economic level are 
shown in Table 2. The figures appear in 
Mosteller’s chapter in Cantril.1 Campbell’s® 
figures are not available for comparison. The 
results in Table 2 were obtained by different 
interviewers making calls about two months 
apart on the same respondents. The origina! 
sample was nation-wide in scope, the first wave 
of interviewers being assigned by the usual 
quota control methods, according to city-size, 
state, economic status, and sex. The first 
wave of interviewers was instructed to get 
names and addresses when possible. A panel 
was thus formed, and in cities of over 100,000 
new interviewers were sent out to report on 
the panel about two months later. 

The Des Moines-Springfield results are quite 
similar to those of the Office of Public Opinion 
Research, despite the differences in methods. 

‘Op. cit. Results are also reported for call-backs by 
the same interviewers but are not reproduced here, 


since the methods were quite different. 
5 Op. cit. 


Summary 


1. Rather common procedures in personal 
interview surveys yield satisfactorily reliable 
reports of respondent ages. The two studies 
briefly described herein gave product-moment 
correlations for the age variable of .85 and .80, 
respectively, between original and repeat inter- 
views about three months apart. The Office 
of Public Opinion Research reported a correla- 
tion of .91 between original and repeat inter- 
views about two months apart. The methods 
employed in the present studies and the Office 
of Public Opinion Research’s differed some- 
what. Even so, the correlations are remark- 
ably similar. 

2. Reports of education appear to be some- 
what less reliable than are those on age, but are 
satisfactory. Original vs. repeat product- 
moment correlations obtained in the two stud- 
dies reported herein were .82 and .67, respect- 
ively. The Office of Public Opinion Research 
reported no results on education. The dis- 
crepancy between the correlations of .82 and 
.67 can be explained in various ways, but no 
evidence is available to justify one possible 
explanation over the others. It should be 
noted that these correlations are reliability 
coefficients; they do not necessarily reflect the 
validity of reports of education. 

3. Interviewer ratings of socio-economic 
level are definitely less reliable than are those 
for age and education. Original vs. repeat 
product-moment correlations obtained in the 
present studies were .61 and .42, respectively; 
the Office of Public Opinion Research reported 
a correlation of .63. Here again no evidence 
is available to justify one explanation over the 
others for the difference between the correla- 
tion of .42 and the other two (.61 and .63). 
These reliabilities are adequate, however, for 
demonstrating gross differences in attitudes, 
buying behavior, etc., as between different 
socio-economic groups. 

On the other hand, improving the measures 
in such a way as to raise their reliabilities 
may accent those differences, or reveal differ- 
ences which have previously been obscured. 


Received February 20, 1950. 
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Readability of Typography in Psychological Journals 


Robert S. Soar 
University of Minnesota 


The lag between the publication and the ap- 
plication of research findings is well known. 
It might be hoped, however, that the scientist 
would apply the findings within his own disci- 
pline rather promptly. Specifically, psychol- 
ogists have done research on the readability of 
type and have discovered what typography 
arrangements are read most efficiently. One 
might expect that these results would soon find 
application in the printing of psychological 
journals. 

Paterson and Tinker (6) have presented a 
manual for typographers which includes a 
critical review of the literature on typography 
up to 1940, and the results of extensive re- 
searches of their own into the influence of typo- 
graphy on speed of reading. Comprehensive 
summaries of these data have been presented 
for use in typography for newspaper advertis- 
ing by Anthony (1) and for evaluation of read- 
ability by Burtt (2). Paterson and Tinker 
have extended these researches (7, 9) and fur- 
ther demonstrated that, in general, those 
printing practices that are read most rapidly 
are rated by readers as the most legible and the 
most pleasing in appearance. 

Luckiesh and Moss (3, 4) have also worked 
in this area since the 1940 review of Paterson 
and Tinker, but their only controi of the speed 
of their subjects’ reading was to ask them to 
read at their “normal” rate. As a measure 
of readability they employed blink rate, which 
Tinker (10) has demonstrated to be unrelated 
to speed of reading. These aspects of their 
methodology make their findings difficult to 
evaluate, and as a consequence they were not 
utilized in this study. 


The Present Study 


The present study is an investigation of the 
degree to which the printing practices of a group 
of eighteen psychological journals conform with 


what was found by research to be optimal. The 
criteria of legibility—defined in terms of speed 
of reading, reader preferences, and economy of 


presentation, as set forth by Paterson and 
Tinker (6,7,9) were employed. This tech- 
nique of comparison has already been used by 
Nelson (5). 

The first issue of volume 1, or the 1920 
volume, whichever was later, and the first 
issue of 1950 were examined for each of the 
journals. Since all commonly used type styles 
(with the exception of Old English and Ameri- 
can Typewriter) are approximately equally 
legible (6, pp. 13-28), this aspect of typog- 
raphy was not recorded. 

The aspects of typography examined as 
well as the results obtained are presented in 
Table 1. 

In addition to the results presented in 
Table 1 several other aspects of typography 
were examined. Since readers prefer a type 
style bordering on bold face (6, pp. 26-7) and 
dull finished paper stock (6, pp. 134-5), both 
of these variables were rated subjectively. 
Since these ratings were found to be of ques- 
tionable reliability, they were discarded. 

The data on type form presented in Table 1 
indicate that all the journals followed optimal 
practices in the use of italics and bold face. 
Widespread and in some cases increasing use is 
mace of all capital type, however. It should 
be noted here that some of the data are in- 
complete. A number of the earlier issues were 
available only in bound volumes in which the 
covers had been discarded. Where title pages 
were available, the practices used there were 
examined, subject to the assumption that they 
were the same as had been used on the cover. 
But even this was not available in all cases. 
The data are also incomplete for table headings 
in that one journal, the Psychological Abstracts, 
does not present tables. 

The use of all capital type has been found to 
retard reading speed in continuous text by 
almost 12 per cent (6, pp. 22-3). This was 
one of the larger decrements found among all 
the nonoptimal arrangements studied. It was 
also found that most readers (90 per cent) pre- 





Readability of Typography in Psychological Journals 


Table 1 
Summary of Printing Practices of 18 Psychological Journals 











Typography Practice 





Type form 
Italics excessive 
Use of capitals 
Journal title 
Journal cover information 
Article title 
Page headings 
Subheadings 
Table headings 
Excessive use of bold face 
Bold face for headings and covers 
Type size, line width, and leading 
Spatial arrangement—double column 
Cover and cover print color 


1920 or 





* Cases in parentheses questionable. 


fer lower case over all capital type (6, pp. 
24-5). It might seem that the use of all 


capital print in situations other than body type 


would be a matter of little importance to the 
reader. Yet these are situations in which the 
intent would be to attract the reader’s atten- 
tion, and to enable him to read at a glance. 
It would seem incongruous to use a non-pre- 
ferred and slowly-read type form in such cases. 
In fact, Faterson and Tinker (8) found that 
newspaper headlines are read significantly 
slower in all capitals than in caps and lower 
case, ie., by 5-18 per cent. Nevertheless 
psychologists, as shown in Table 1, fail to apply 
these findings in printing their journals. A 
lag of 5 to 10 years between research findings 
and application of the findings apparently is 
too short! 

It can be argued that legibility of cover 
pages, article title, page headings, and sub- 
headings is unimportant because these are 
merely facade trimmings. But this argument 
does not hold for table headings because surely 
the editor desires readers to grasp and grasp 
quickly the content of a table. Generally 
speaking nonoptimal typography in the “un- 
essentials” is also accompanied by nonoptimal 
typography in the “essentials,” such as table 
headings. 


One journal, the Journal of Consulting’ Psy- 
chology, was found to use all capital type in all 
the' categories examined. Three others used 
all capital type in all but one of the categories 
examined. This category in each case was 
journal cover information, in which a combina- 
tion of all capital and capital and lower case 
type was used. These three journals were 
Journal of Social Psychology, American Journal 
of Psychology, and Journal of General Psy- 
chology. 

The aspect of typography of most impor- 
tance to the reader is the legibility of body type. 
In this study no nonoptimal combinations of 
type size, line width, and leading were found. 

With respect to spatial arrangement of the 
printed page, it may be seen in Table 1 that 
double column print is being used more widely 
by 1950. Whether this is a consequence of 
reader preferences being followed or of the 
pressure to print more on the page is in doubt, 
but the result is desirable in any case. 

For cover and cover print color, some of the 
data are missing because the covers were re- 
moved in binding. In a number of cases the 
classification of these combinations as optimal 
or nonoptimal is subjective to a considerable 
degree. For a few journals no decision could 
be made. The research findings (6, pp. 118- 
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129) have indicated brightness contrast to be 
one of the more important variables determin- 
ing the legibility of various combinations of 
print and background colors, and since not all 
combinations have been investigated experi- 
mentally an area of doubt exists between the 
definitely optimal and the definitely nonopti- 
mal. With one exception, all the cases indi- 
cated as questionable in the table were com- 
binations of intermediate contrast. The ex- 
ception was one of white print on a blue back- 
ground. The contrast here would seem quite 
acceptable, but the research findings (6, p. 
113) have indicated white on black, a some- 
what similar case, to be nonoptimal. Paren- 
thetically, a new journal, the Psychological 
Service Center Journal, not included in the 
study, presents a black cover with silver type 
which provides a new low in legibility! In 
this case and in several others, the legibility 
of cover page printing is highly important 
because of the fact that the articles and page 
numbers are listed. Among the more clearly 
nonoptimal in this study were the combina- 
tions of black print on a brown cover, used by 
Comparative Psychology Monographs; light red 
on light gray, Psychological Monographs, and 
black on dark red used by the Journal of Gen- 
eral Psychology. Other combinations, such as 
black on blue, gray, or even light red would 
also seem to be nonoptimal. 

In addition to the variables indicated in 
Table 1, the per cent of the total page devoted 
to print was computed. For the 1920 or 
volume one issues the median per cent printed 
was 51.5, the range 44-69; for the 1950 issues 
the median was 60, the range 52-70. This 
trend toward larger percentages in the 1950 
issues is found despite a bias in the data which 
would tend to produce a larger proportion of 
the page devoted to print in the earlier issues. 
Eleven of the early issues were available only 
as bound volumes, seven unbound; whereas 
the later issues were all unbound. Presum- 
ably some of the margin wuuld have been lost 
from trimming and stitching in the binding 
process, resulting in an increased per cent of the 
page devoted to print. Thus more than half 
of the early issues showed a higher proportion 
of printed area than was really the case prior 
to binding. 

Four journals were found to utilize less than 


55 per cent of the page in print in the issues 
measured. They were the Journal of Social 
Psychology, the American Journal of Psy- 
chology, Journal of General Psychology, and the 
Journal of Psychology. 

Paterson and Tinker (6, pp. 86-92) cite 
evidence that the average reader is subject to 
a “part-whole” illusion in estimating the pro- 
portion of a page that is devoted to print. He 
estimates it to be from 18 to 25 per cent greater 
than it really is. Further evidence is presented 
(6, pp. 97-8) showing that margins are not 
necessary to legibility at all, and are to be 
justified only on esthetic grounds. 

In order to get an idea of the practical re- 
sults of extending the printed area to a larger 
proportion of the page a series of word counts 
were made. Two journals were found that 
were exactly matched, within the accuracy of 
the measurement, for total page area. One 
had 52 per cent of the page in print as against 
65 per cent for the other. The first used 10 
point type with two point leading and a 27 
pica line. The other used a two column ar- 
rangement printed in 10 point type with one 
point leading and a 14 pica line. Both are 
equally legible, but the lesser leading would of 
course contribute to the number of words that 
were printed on a page. A random sample of 
10 pages of solid body type was taken from 
each, and the number of words counted. The 
first had a mean of 412 words per page com- 
pared to 580 for the other; t=3.47, P<.01, 
showing that the difference is statistically sig- 
nificant. The first, then, presented only about 
71 per cent as many words per page as the 
second. 

When a considerable lag exists between the 
acceptance and the publication of articles, in- 
crease of the page area devoted to print would 
seem to be a partial solution for the journal 
that currently uses little more than half of 
each page. Use of a two-column arrangement 
and narrower margins produces, therefore, 
many more words per page without loss in 
legibility. 

Of all the journals examined, only one used 
optimal practices in all the categories examined. 
Excessive use was not made of italics or bold 
face, but bold face was used for headings and 
titles; no all-capital type was used; a two col- 
umn arrangement was employed, ysing 10 
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point type with one point leading in a 16 pica 
line. A page ten by six and five-eighths inches 
was used, and 68 per cent of it was covered by 
the print. The cover employs black print on 
a yellow jacket. This is the Journal of Ap- 
plied Psychology. 

Among the other journals there was con- 
siderable variation in the printing practices 
that were set up in nonoptimal fashion. No 
single journal, however, was outstanding in 
using more nonoptimal practices than numbers 
of others. To select a single journal as using 
the most undesirable arrangements would have 
required a judgment as to which of these prac- 
tices should be weighted most heavily. This 
was not done. Rather, the procedure was 
followed of selecting several journals in each of 
the areas in which nonoptimal practices were 
most widespread. But although the several 
journals identified in each area were to a de- 
gree extreme, they were also to a considerable 
degree typical. There were others, unnamed, 
that were little different in each case. 


Summary and Conclusions 


Eighteen psychological journals were ex- 


amined for conformance with printing prac- 


tices demonstrated to be optimal. Either 
volume one or the 1920 volume, whichever 
was later, and the 1950 volume were examined 
for each journal. 

The results warrant the following con- 
clusions: 


1. In general, the practives used in present- 
ing body type have been and are optimal. 

2. In other cases, such as type form in titles 
and headings, and cover and cover print color 
combinations, nonoptimal practices have been 
widespread or have even increased. This is 


particularly true with respect to use of all- 
capital printing. 

3. At least some of the optimal practices 
which have increased, such as using double 
column print and smaller margins, might be 
attributed to the need for printing more on 


every page. 

4. In general, there is little evidence that 
available research findings are being applied 
although this application would result in more 
legible printing. 


Received August 8, 1950. 
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Flesch’s ‘Measuring the Level of Abstraction” 


James J. Jenkins and Robert L. Jones 
University of Minnesota 


Flesch’s recent article, “Measuring the Level 
of Abstraction,’ presents a sound empirical 
approach to the evaluation of abstract words 
in a piece of writing. He has clearly pointed 
out that this is a live and important topic. We 
feel, however, that his application of the find- 
ings to the measurement of reading ease may 
be somewhat misleading to the reader who be- 
lieves that this formula is necessarily better 
than the one previously presented in this 
journal.2, In the opinion of the writers this 
is not at all the case. 

For both formulas Flesch sets up Beta 
weights on the Cyo criterion and then transfers 
to the C75 criterion. The multiple correlation 
coefficients given are assumed to be those of the 
Cso criterion. One must ask, what is the 


actual C7, correlation? To what extent is 
this criterion really predictable? 


The correlation given in the latest article 
for syllable count alone is .69. The multiple 
correlation (adding the measure of abstraction 
as a predictor) is only .72. Is this .03 gain of 
either statistical or practical significance? 

Flesch states that the correlation of “definite 
words” (.55) is higher than that found for 
“average sentence length” (.52) in the earlier 
paper. This statement would appear to lack 
justification since the sample of reading ma- 
terial used in the latest paper is different from 
that used before. It might be noted that the 
correlation for “word length” increased with 
this difference in material from .66 to .69, the 
same difference of .03. 

In the earlier article Flesch pointed out that 
sentence length and word length are measures 
of abstraction and concluded, “Formula A, 
therefore, is essentially a test of the level of 
abstraction.” One wonders to what extent 
Formula A and the latest measure of abstrac- 


1 Flesch, R. Measuring se a of abstraction. J. 
ba Psychol. 1950, 34, 384- 

h,R. Anew *cadabitity yardstick. J. appl. 

Psychol 1948, 32, 221-233. 
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tion are correlated. If this correlation were 
high, the earlier formula might provide a con- 
venient shortcut to the more recent measure. 

Finally, the procedure for obtaining the 
count of definite words seems to us to be much 
too complex for general use. The previous 
formula has been demonstrated to be highly 
reliable? but the consistency and accuracy 
task imposed on an analyst by the new formula 
would seem quite formidable. Obtaining the 
count requires sixteen substeps with thirteen 
limitations or qualifications. The use of the 
count requires a knowledge of such recondite 
matters as natural and common gender nouns, 
indefinite pronouns, finite verb forms, auxili- 
ary verbs, “to be” used as copula, present 
participles used as part of the progressive 
tense, “which,” “that,” and “the” in special 
cases but not in others, etc. The assumption 
that a user of the formula can make these dis- 
criminations is not justified. As Lorge* points 
out, “. . . the number of prepositional phrases 
is likely to be inexactly counted because many 
teachers and research workers do not know 
what a prepositional phrase is.” And as a 
readability formula this is (at best) only .02 
correlation points better than the formula 
using sentence length, which is a very simple 
variable to compute. 

Essentially our feeling is that if one is inter- 
ested in abstraction per se, this new measure- 
ment may be of definite value. But as a 
measure of readability we feel that this offers 
very little to the worker in the field. Existing 
instruments would seem to be as good or 
better. 
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Flesch’s “Measuring the Level of Abstraction” 


Reply to Criticism by Jenkins and Jones 


Rudolf Flesch 
Dobbs Ferry, New York 


Jenkins’ and Jones’ critique is welcome and 
valuable. They are right in emphasizing that 
the new formula is not necessarily better as a 
measure of readability than the earlier one. 
It is primarily a measure of abstraction level 

-and only secondarily a measure of readability. 
At any rate, like any other test, it will have to 
prove its value in the field. 

As to Jenkins’ and Jones’ specific comments, 
the point raised about the Cs and C7, criteria 
seems irrelevant here. The Cso criterion, 
having a smaller standard deviation, yielded 
higher prediction values; the C75 criterion was 
substituted by earlier investigators because it 
gave a better indication of grade levels. 

The correlation of average sentence length 
to the narrower criterion used in the latest 
paper was .52, the same as the correlation to 
the wider criterion reported in the earlier 
paper. This fact was not reported because of 
an oversight. 

The correlation between the count of definite 
words and word length in syllables is —.57, 
that between the count of definite words and 
average sentence length —.49. On this basis, 
the earlier formula cannot be considered a con- 
venient shortcut to the more recent measure. 

Aside from these relatively minor points, 
Jenkins’ and Jones’ main criticism is that the 
new formula is much too complex for general 
use, considering the small statistical increase 
in its correlation with the criterion. 

It is obviously true that the count of definite 
words is a rather complex task. However, in 
comparing it with the earlier formula, it should 
be remembered that it is offered as a substitute 
for Formula B rather than Formula A. As 
shown in the paper by Hayes, Jenkins, and 


Walker, cited by Jenkins and Jones, the 
analyst-to-analyst reliability of Formula B is 
somewhat lower than that of Formula A. 

To evaluate the new formula solely on the 
basis of its numerical correlation with the cri- 
terion is to take too narrow a view of readabil- 
ity measurement. Jenkins and Jones seem to 
consider readability as a readily defined, fixed 
quality, so that the problem of measurement 
becomes simply the problem of finding the most 
easily applicable yardstick. However, read- 
ability surely can not be defined in terms of 
the criterion used—that is, the grade level of 
children who could answer correctly 34 of the 
test questions appended to McCall-Crabbs’ 
test lessons. Rather, readability is a complex 
quality of written prose that relates to a variety 
of factors on the part of readers on different 
levels of age and education—for example, 
comprehension, readership, reading speed, 
recall, and attitude changes. In other words, 
in its wider context, a readability measurement 
formula should be considered as a diagnostic 
and clinical tool in the pathology of communi- 
cation. Viewed in that way, the complexity 
of the new formula will have to be balanced 
against its clinical value. It is possible, for 
instance, that the count of definite words will 
give a better prediction of readership or recall 
than any other readability measures so far 
proposed. If so, the clinical value may 
justify the complexity of the tool just as, in 
another field of psychology, the clinical value 
has justified the complexity of the Rorschach 
Test. 


Received November 22, 1950. 
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The Role of “‘Cutting’’ in the Perception of the Motion Picture * 


Herman D. Goldberg 
Department of Psychology, Hofstra College 


Parts of a stimulus are perceived as belong- 
ing to a whole. The way in which the whole 
is perceived will influence the meaning of the 
parts. In other words, the meaning derived 
from a given part of a whole is dependent upon 
the surrounding parts (1). 

Can exactly the same film scene placed in a 
different context produce a different perceptual 
meaning for the people viewing it? This ex- 
periment was conducted in an effort to show 
that it is the splicing of the film which pro- 
duces the desired perceptions rather than the 
photography per se. In many cases the mo- 
tion picture maker cuts and splices together 
unrelated material to achieve the desired per- 
ceptions. Individual scenes are sometimes 
made for different purposes, at different times, 
and at different places but are combined in a 
specific order so as to give the viewer an entirely 
new concept that could not be derived from 
seeing the scenes separately or in any other 
order. 

Hypothesis 

Individuals will react and describe differently 
the same motion picture scene shown in differ- 
ent context. 

Materials 

A motion picture projector, screen, two short 

films, and a questionnaire. 


Film A consisted of four scenes spliced to- 
together: 

Scene 1—Boy riding tricycle. 

Scene 2—Foot depressing auto brake 
pedal. 

Scene 3—Auto wheel—not in motion. 

Scene 4—Woman screaming. 

Film B also consisted of four scenes spliced 
together: 

Scene 1—Boy riding tricycle (same as 
Scene 1, Film A). 

Scene 2—Boy stopping tricycle, getting 
off and placing toy lamb on 
head. 

Scene 3—Man laughing. 


* Assisted by S. E. Perlman. 


Scene 4—Woman screaming (same as 
Scene 4, Film A). 


It is important to note that scenes 1 and 4 
were exactly the same in both Film A and Film 
B. 

For the subjects in this experiment 147 under- 
graduate students of psychology at Hofstra 
College were used. 


The Questionnaire 


1. Relate the story sequence of the film you 
have just seen: 

2. Place a check opposite the word which 
best describes the emotional behavior of the 
woman in the film: __a. Fear; __b. Sorrow; 
—c. Rage; —d. Joy; —-e. Anger; —f. Dis- 
gust. 

The “story sequence” part of the question- 
naire was used only as a check to prevent quick 
changes of mind as a result of ‘seeing through” 
the experiment or seeing someone else’s re- 
sponses. This plan seemed to work as it was 
noticed that some individuals suddenly “caught 
on,” changed their checked responses but did 
not have time to re-write the story sequence 
paragraph. 

Procedure 


The subjects were divided into two groups. 
Group I contained 90 subjects and Group II 
contained the other 57. Group I was shown 
Film A first and then told to answer the ques- 
tionnaire. Then they were shown Film B and 
told to fill out the second questionnaire. The 
same questionnaire was used in both cases. 

Group II was shown Film B first and then 
Film A and the questionnaire was used as be- 
fore. The Phi Coefficient and X? were used to 
analyze the data. 


Results 


Tables 1 and 2 show the results in terms of 
the percentage distributions of the actual 
response. ; 

The tables clearly show the trend of the indi- 
viduals tested to perceive the emotion of the 
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woman in Scene 4 as being some emotion other 
than fear after seeing Film B. In Group I, 92 
per cent of the subjects checked fear as the 
emotion exhibited by the, woman, but this was 
reduced to 57 per cent after seeing Film B. On 
the other hand, no individual identified the 
emotion as joy after seeing Film A, yet 27 per 
cent of Group I said the same scene expressed 
joy after seeing Film B. 

When Film B was shown first (Group IT) the 
same general trend could be noted. After 
seeing Film B first, 77 per cent of the subjects 
checked fear as the emotion expressed by the 
woman. This was increased to 94 per cent 
after seeing Film A. The joy responses de- 
creased from 9 per cent after seeing Film B to 
0 per cent after seeing Film A. 

Due to the nature of the data, all analyses 
were made between the two films and the fear 
and “other” responses. 

The control was used to determine if Film A 
and Film B were actually perceived to be differ- 
ent. The X? computed was 8.28, which is sig- 
nificant at the 1 per cent level. Therefore, it 
could be said that the null hypothesis was re- 
jected and that the two films were different. 

The next comparison was made with the re- 
sponses to Film A when shown first and when 
shown second. The X? value was 4.32, sig- 
nificant at the 5 per cent level. Again the null 
hypothesis was rejected and it was assumed 
that there was a true difference in resporise due 
to the showing order of the films. 


Table 1 


Percentage of Checked Responses for Group I 
on Films A and B 
Note: N = 90 








Film A 
Per Cent 


Film B 

Per Cent 

. Fear 92 52 

. Sorrow 0 

. Rage 2 

. Joy 27 

. Anger 10 

. Disgust 2 
Horror* 2 
Panic* 0 


100 100 


Emotion 





a 


Total 





* Responses written in on the questionnaire by the 
subjects. 


Table 2 


Percentage of Checked Responses for Group IT 
on Films B and A 
Note: N = 57 








Film B 
Per Cent 


Film A 


Emotion Per Cent 





. Fear 77 
b. Sorrow 0 
. Rage 12 
. Joy 9 
. Anger 
. Disgust 

No response 0 


Total 100 


In comparisons of responses within Group I, 
the X? value was 29.34 (significant at the 1 per 
cent level) and it was assumed that there was a 
significant change in response to Film B after 
seeing Film A. In comparisons of responses 
within Group II the X? value was 8.13 (sig- 
nificant at the 1 per cent level) and it was as- 
sumed that there was a significant change in 
response to Film A after seeing Film B. 


Conclusions 


' The concept formed by the perception of a 
motion picture film is not dependent upon the 
reaction to the individual scenes that have been 
cut and spliced together but rather on the se- 


quence as a whole. One of the identical parts 
or scenes of Film A (Scene 4) was described 
significantly differenily than when viewed in 
Film B. This difference was due to the show- 
ing of the same scene in a different context. 

There is a significantly different response to 
the same film when shown before or after the 
second film. 
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Measuring Exposure to Advertisements 


William T. Moran 
Commercial Research Dept., Pillsbury Mills, Inc., Minneapolis, Minnesota 


Methods for obtaining an accurate measure 
of the portion of a universe which has been ex- 
posed to an advertisement or magazine have 
been eagerly sought by members and clients of 
the advertising profession. At least two major 
techniques already have emerged. The first 
was an interview technique and statistical cor- 
rection formula advanced by Lucas (2, pp. 621- 
623) to measure: advertisement readership. 
The second is the one currently used by the 
Magazine Audience Group in its Continuing 
Study of Magazine Audiences (6). This tech- 
nique seeks to meet the problem of magazine 
readership entirely by interviewing method. 
The “Starch” readership ratings provided by 
Daniel Starch and Staff make no attempt to 
correct for confusion beyond normal inter- 
viewing methods. 

The Lucas technique (4), which this author 
considers to be the most outstanding contribu- 
tion in this area, consists of the presentation of 
an equal number of unpublished advertise- 
ments, along with published advertisements, 
to a sample. The interviewer then asks the 
respondents whether or not they recall having 
seen the advertisements (pretest), and the 
same procedure is repeated after publication 
(post test). The correction formula for the ob- 
tained recognition consists of applying the pro- 
portion which claims recognition on the pretest 
and post test in the following manner: 


Post test score— Pretest score Adjusted 


(100) 





100— Pretest score ~ Audience 


The Magazine Audience Group technique ap- 
proaches the problem of confusion by an inter- 
viewing technique which tends to force con- 
fused persons to disclaim recognition. 

Criteria. Lucas (4, p. 136) defines the audi- 
ence which he determines as “only those per- 
sons who saw and got some measurable impres- 
sion of the advertisement when they looked 
into the magazine.” However, he does not 
strictly confine his measurement to his defini- 
tion inasmuch as he includes the same propor- 


tion of his “confused” persons in the adjusted 
audience as he does his accurate persons. 

The Magazine Audience Group technique 
defines the audience similarly. Both tech- 
niques seek to remove all measurable inflation. 
Bigelow (1, p. 337) on the other hand, recog- 
nizes the existence of deflationary confusion 
but states that “this ‘deflationary’ confusion 
does not distort the ratings, since the combina- 
tion of recognition and recollection furnishes 
an adequate criterion of readership.” Un- 
fortunately, there is considerable reason to be- 
lieve that those impressions which remain in 
the subconscious affect behavior as greatly as 
do impressions which can be easily recalled. 

Repeated impressions, all subliminal, may 
result through the process of summation in the 
eventual conscious perception of the stimulus. 
Also, a subliminal stimulus which only pene- 
trates the subconscious may later move from 
there to the conscious surface when the physi- 
ological and psychological condition of the 
recipient is amenable. 

There are other organizations today which 
are espousing the revelatory conclusion that 
there is no confusion among respondents, and 
they, apparently, have taken their lead from 
the Magazine Audience Group method. 
Rather, they insist, there are only confused 
questionnaire items. Such a position is 
steeped in the same psychological naivete as 
are election reports from oppressed countries 
where, according to the ballot, there is abso- 
lutely no defection among the citizens. Un- 
fortunately, our psychological frailties cannot 
be eliminated by a mere refusal to recognize 
them. 

Hypotheses 


1. Chance Distribution of Confused Responses. 
Lucas’ formula omits consideration of the pos- 
sibility of negative responses by confused per- 
sons. There is, therefore, the implicit assump- 
tion that all confused responses are affirma- 
tive. The Magazine Audience Group tech- 
nique makes the same operational assumption 








Measuring Exposure to Advertisements 73 


since it considers the confusion problem re- 
moved when persons who are doubtful are 
forced to disclaim readership. This may be 
why the two techniques achieved such similar 
results in the comparative test conducted by 
the Magazine Audience Group (6, p. 11). 

If specific determiners are removed from the 
situation, it seems logical that confused per- 
sons will respond in a chance distribution. In 
the Lucas technique the respondent can feel 
that the interviewer will be pleased by a “‘Yes”’ 
response. This specific determiner biases the 
confused responses in that direction but to an 
unmeasurable extent. Affirmative bias can 
be removed by showing the pretest respond- 
ents pairs of advertisements in which both ad- 
vertisements in a pair are unpublished and 
advertising the same brand product. The 
respondent is informed that, by arrangement 
with the magazine, one of the advertisements in 
each pair was printed in some copies and the 
other advertisement in the other copies of the 
magazine in such a way that it is highly un- 
likely that any person has seen both. The 
interviewer then asks which, if either, adver- 
tisement has been seen. The respondent is 
then placed in the position of choosing between 
two advertisements and cannot, therefore, 
“please” the interviewer simply by saying 
“Ves,” 

In such a pretest situation the best guess of 
the behavior of confused persons would be that 
it approximates behavior of guessers on a true- 
false test. Therefore, the hypothetical number 
of confused responses on the pretest is twice 
the number who state that they have seen the 
experimental advertisement. In this type 
situation “Don’t Know” responses can be 
split evenly into “Yes” and “No” categories. 
Such responses are confused and, therefore, 
following the previous assumption of chance 
distribution, must be said to fall randomly on 
a continuum between “Yes” and “No.” 

2. Biased Distribution of Confused Responses. 
The thinking behavior involved in recalling 
something which has occurred (positive experi- 
ence) differs markedly from that of determining 
that something has not occurred (negative ex- 
perience). For that reason there is greater 
confusion on the part of persons who have not 
been exposed to a stimulus than on the part of 
those who have. Furthermore, a dramatically 


unique stimulus will form more and stronger 
associations than lesser stimuli. Through 
empirical knowledge of this function persons 
can ascertain more quickly and accurately that 
they have not been hit by an automobile than 
that they have not eaten apple pie in the past 
month. ; 

An advertisement with a large degree of 
dramatic uniqueness will elicit fewer confused 
responses on the pretest and, through the re- 
gression-like process of positive experience, 
will result in a larger proportion of affirmative 
answers on the post test among those who were 
exposed to the advertisement but would have 
been confused on the pretest. The proportion 
of confused responses, therefore, decreases 
from pretest to post test in some relation to the 
dramatic uniqueness of the advertisement as 
measured by the proportion of confused per- 
sons on the pretest. Although this function 
is probably curvilinear, the best assumption 
that can presently be made is that the rela- 
tionship is rectilinear. 3 

The proportion of confused responses on the 
pretest can vary from 100% to 0% (increasing 
dramatic uniqueness). Among those who were 
exposed to the advertisement and who would 
have given (did give) confused responses on the 
pretest, the proportion of affirmative answers 
on the post test can vary from 50% (random) 
to 100%. 


% Affirmative responses on post 
test of confused “persons” who 
were exposed 


% Accurate (not confused) 
on pretest 





= 50%+ 5 


Since there is no medium for the mass com- 
munication of knowledge (stimuli) apparent to- 
day whereby each recipient receives a stimulus 
exclusive of all other persons, the more wide- 
spread the perception of (exposure to) a stimu- 
lus the greater the duplication of exposures. 
Therefore, the greater the number of persons 
who have been exposed to an advertising stimu- 
lus the greater the number of multiple exposures 
among those who have been exposed at all, 
and by summation, the greater the conscious 
awareness of those who were prone to be con- 
fused. 
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The cumulative effect of positive experience 
is such that while the proportion of the popu- 
lation which has been exposed can vary from 
0% to 100%, the proportion of affirmative re- 
sponses can vary from 50% (random) to 100% 
among the exposed persons who would have 
given confused responses i ihe negative experi- 
ence situation. This effect, also, is probably 


curvilinear, but without further evidence we 
must say it is rectilinear. 
: e 


% Affirmative responses on post 

test of confused “persons” who 

were exposed 

Proportion of sample population 
exposed to the advertisement 


=50%+ ; 





The author expects that there will be con- 
siderable dissatisfaction with the foregoing as- 
sumptions. Any such dissatisfaction, how- 
ever, should prove more stimulating to further 
research. These hypotheses are consciously 
made and quite subject to verification or 
modification by further research with the ex- 
perimental techniques now available. There 
are many other variables affecting response 
behavior, and they can be included as they and 
their behavior are determined. 

3. Analysis of Confusion Factors. Thus, on 
the post test the affirmative response total is 
composed of (a) the number accurate (not 
confused) which was exposed to the advertise- 
ment, (b) one-half of the number confused 
which is unexposed, and (c) some proportion of 
the number confused on the pretest which was 
exposed. This number is determinable from 
the foregoing equations for biased distribution 
of confused responses. The proportion of the 
total sample answering affirmatively, ¢, equals 
a+6+c. In deriving the correction formula 
which follows the following symbols will be 
used : 


a=the accurate proportion responding 
affirmatively on the post test; 
the confused proportion on the pretest 
who are unexposed + 2; 
the confused proportion on the pretest 
who are exposed and answer affirma- 
tively on the post test; 

d = the accurate proportion (not confused) 
on the pretest; 


t = the proportion of the total sample re- 
sponding affirmatively on the post test 
(the proportion responding “Yes” plus 
one-half the proportion responding 
“Don’t know”); and | 

x = the proportion of the sample which was 
exposed to the advertisement. 


Therefore, the following relationships may 
be said to exist within the hypothetical limits 
stated in the foregoing portions. 


a=xd 
b=(1-— -95 wi 
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The above equation, then, gives the proportion 
of a sample which has been exposed to an ad- 
vertisement using only the proportion of the 
sample which was not confused on the pretest 
and the proportion of the sample which re- 
sponded affirmatively on the post test. 











t=xd+(1-— 








Procedure 
The next step was to put the foregoing hy- 


. potheses into practice with a pilot study. An 


experimental design was constructed to com- 
pare the Lucas method with the author’s 
method. 


1. The Displays. The display to test the 
Lucas method (hereinafter referred to as 
Method I). consisted of seven advertisements 
including unpublished avertisement X. The 
other display to test the author’s method 
(Method II) consisted of four pairs of adver- 
tisements—one pair including advertisement 
X. Unfortunately, except in the case of ad- 
vertisement X, unpublished advertisements 
were unobtainable, and it was necessary to use 
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advertisements one and a half years old in 
their stead. This compromise was felt to be 
sufficient for a preliminary study. 

The selection of control advertisements for 
each pair determines what it is that is being 
measured. Where the situation is amenable, 
respondents generalize their purported recall 
to include the largest generic category possible. 
In fact all affirmative responses actually mean 
“Yes, I’ve seen an advertisement like that.” 
The breadth of meaning covered by the word 
“‘like”’ is controlled by the similarity or dis- 
similarity of the control advertisement. It 
should, then, be possible to measure exposure 
to various stimuli from a campaign series of 
advertisements through an individual adver- 
tisement to a specific portion of an advertise- 
ment by proper selection of control advertise- 
ments. An advertisement is no single stimulus 
but, rather, a matrix of stimuli more or less 
related and coherent depending upon the artist 
and the copy writer. 

2. Interview Methods. Interview Method I 
was conducted by asking the respondent which 
ones, if any, of the advertisements in the dis- 
play were seen “in the past month or so.” 
Then the interviewer flipped over each display 
advertisement one at a time so that the re- 
spondent determined her response to each ad- 
vertisement before knowing what advertise- 
ment would come next. 

Interview Method II required a more elabo- 
rate preamble in which the respondent was told 
that the magazines in which each pair of ad- 
vertisements was published alternated adver- 
tisements in such a way that “it is almost im- 
possible for, anyone to have seen both versions 
of the advertisement, and, of course, some 
people may not have seen either one.” 

3. The Samples. For the pretest the inter- 
viewers were given a quota sample of house- 
wives to obtain in Minneapolis, Minnesota, 
and were told to alternate Methods I and II 
from interview to interview so that every other 
interview was by Method I and the alternate 
interviews by Method II. In this manner 
comparable samples were obtained for each 
method. The pretest sample for Method I is 
referred to, hereafter, as sample I,;, and the 
pretest sample for Method II is referred to 
as sample ITa). 


On the post test the interviewers were told 
to interview the same persons as on the pre- 
test, using the same method in each case. 
This post test, sample for Method I is referred 
to as sample I,2, and this post test sample for 
Method II is teferred to as sample IT,>. 

In addition, on the post test, the interviewers 
were told to obtain for each interview in sample 
I,2 an additional interview one block further 
on the route and on the opposite side of the 
street, or as near thereto as possible. The 
same instructions were issued for each inter- 
view in sample II,2. This post test sample for 
Method I is referred to as sample Ip2, and this 
post test sample for Method II is referred to 
as sample Ippo. 

4. The Experimental Design. The experi- 
mental design was very simple. Pretests were 
conducted on samples I,4; and II,; prior to 
publication of advertisement X. Four to six 
weeks after publication of advertisement X in 
two national monthly magazines and one na- 
tional weekly magazine post tests were con- 
ducted on samples I,2, Ine, [Ta2, Ize. The only 
change in the interview situation was to ask 
respondents whether or not they had seen any 
of the advertisements “‘in the past two months 
or so” rather than “in the past month or so”’ 
as on the pretest. 


Results 


As can be seen, considerably different results 
are obtained by the two methods. Part of 
the difference is due to the different definitions 
as to what should be measured and part is due 
to different psychological theories. 

The large difference between the two ad- 
justed audience scores obtained by the Lucas 
method (15%, 24%) arises from a difference of 
3% in per cent affirmative responses in samples 
In2 and Ips. This 3% difference is not signifi- 
cant in samples of that size; therefore there is 
no significant difference between the 15% and 
24% adjusted audiences. A mathematical 
feature of the Lucas formula is the greater in- 
flation of small differences as the per cent of 
confused responses increases. This feature is 
unfortunate inasmuch as the correction for 
confusion is the prime intent of the formula. 
When there is much obtained confusion, ex- 
tremely large samples would be required to 
reduce sampling error to the point that Lucas’ 
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Table 1 
Results of Comparative Survey: Method I vs. Method IT 

















Method I Method IT 





Sample sizes 
Pretest (Ia:, [Ta:) 
Post test (Iz, Ia2) 
Post test (Ipe, [fp2) 
Results 
Same samples on pretest, post test 
% confused responses on pretest 
% affirmative responses on post test 
adjusted audience (I); % exposed (II) 
Different samples on pretest, post test 
% confused responses on pretest 
% affirmative responses on post test 
adjusted audience (I); % exposed (II) 


48 
36 
49 


50 
32 
49 


67% 
75% 
24% 


44% 
53% 
46% 


67% 
72% 
15% 


44% 
52% 
45% 





adjusted audience would possess reasonably 
small error limits. : 

Method II deflated the affirmative responses 
in this case by 7%. The amount of deflation 
varies with the amount of confusion and with 
the affirmative responses in the post test. The 
deflation, under various circumstances, can 
range from a negligible amount to well over 
15% by using Method II and its corresponding 
formula. In fact, in some instances such as 
where the amount of confusion is great and the 
affirmative response on the post test is large, 
the total effect of confusion is deflationary, and 
the formula actually, gives a per cent exposed 
which is larger than the affirmative response on 
the post test. That result is quite logical when 
it is considered that if everyone in a sample 
were exposed, any confused responses in the 
negative would result in an affirmative re- 
sponse smaller than the actual exposure. In 
such an instance the Lucas formula would give 
an adjusted audience smaller than the affirma- 
tive response. 

The similar proportions of affirmative re- 
sponses with the same and different samples on 
the post tests may have been due to attempts 
on the part of respondents to recall having seen 
the advertisement outside the test situation. 
This effect was apparent for both methods. 
There was a 3% difference between samples for 
Method I and a 1% difference for Method II. 

Furthermore, in the case of Method II re- 
spondents in each sample saw both the control 


and the experimental advertisement and were 
asked which one they recalled seeing. This de- 
sign would tend to force respondents who have 
seen the advertisements in the pretest to rely 
on the stronger impression (i.e., the one that 
was reinforced by seeing the advertisement in a 
magazine). The fact that the recognition of 
the published advertisement did increase after 
publication would seem to bear this out. 

In order to forestall any objections on the 
basis of sample representativeness it must be 
stated that the scores obtained for advertise- 
ment X are not held to be representative of 
any universe other than that of the samples 
themselves. The samples resemble each other 
very closely; so they are quite adequate for a 
comparison of Methods I and II. 


Summary 


Current methods of dealing with confusion 
either do not adequately determine the dis- 
tribution of confused responses (2, 3, 4) or else 
seek to avoid the problem by forcing confused 
respondents to respond in the negative, i.e., 
that they have not seen the advertisement in 
question (6). 

Although the method proposed in this paper 
has a pretest and a post test as does the Lucas 
method, it employs displays consisting of pairs 
of advertisements rather than series of single 
advertisements. It is then proposed that the 
confused responses on the pretest (prior to 
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publication of the experimental advertisement) 
distribute themselves in a chance order. 

The subsequent hypotheses of this proposed 
method have to do with the diminution of the 
proportion of confused responses on the post 
test due to the effects of exposure to the ad- 
vertisement. Two factors, “dramatic unique- 
ness” and “positive experience” are consid- 
ered, and approximations of their hypothetical 
effects on confusion are algebraically described. 

A pilot study was then conducted to com- 
pare the Lucas method with the author’s 
method in terms of actual results. The widely 
different scores obtained by the two methods 
were partly due to differences in criteria and 
partly due to differences in psychological 
theory. It is hoped that additional research 
will result from this disparity and lead to a 


better understanding of the dynamics of 
confusion. 


Received February 23, 1950. 
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Moran’s “Measuring Exposure to Advertisements”’ 


Norman Heller 
New York University 


In an article in this journal, Moran! proposes 
a new method of determining exposure to ad- 
vertisements. In general this study is impor- 
tant as it should provoke much thought, 
skepticism, and further research in a neglected 
area. However, there are assumptions made 
by Moran which this writer considers unten- 
able. It should prove worth while to go over 
these singly. 

Concerning the distributicn of confused re- 
sponses, Moran assumes that they resemble the 
behavior of guessers on a true-false test, just as 
many guessing true as guessing false. ‘There- 
fore, the hypothetical number of confused re- 
sponses on the pretest is twice the number who 
state that they have seen the experimental ad- 
vertisement.” Concerning the behavior of 
guessers on a true-false test however, this 
writer is of the opinion that more guessers will 
be correct than wrong because of past experi- 
ence, subliminal cues, etc. Therefore, the 
number of people who choose one alternative 
will not equal the number who choose the 
other alternative. Also, if over fifty per cent 
of the people who say “yes” are confused, and 


1W. T. Moran. Measuring e 
ments. J. appl. Psychol., 1951, 


ure to advertise- 
, 72-77. 


according to the above statement an equal 
number who say “no” are confused, where, I 
ask, are they to come from? 

Another assumption calls for an equal dis- 
tribution of the “don’t know” responses be- 
tween “yes” and “no,” following his previous 
assumption of chance distribution. There are 
four things that are usually done with or 
about “don’t know” responses: © 


A. Divide them equally between the “yes” 
and ‘“‘no” answers. 

B. Divide them proportionately between 
the “yes” and “no”’ answers. 

C. Simply report them as such. 

D. Force the respondent to answer “yes” 
or “no.” 


Like Moran, various pollsters in past elec- 
tions, distributed the “don’t know” answers 
via A or B. Both procedures proved to be 
empirically wrong. In the light of such evi- 
dence the writer favors method D as the safest 
course. For example, if a large proportion of 
the people who give “don’t know” answers 
were exposed to the advertisements, then divid- 
ing these responses equally between yes and 
no would yield biased results. | 
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An interesting point made by Moran is that: 
“Since there is no medium for the mass com- 
munication of knowledge (stimuli) apparent 
today whereby each recipient receives a stimu- 
lus exclusive of all other persons, the more wide- 
spread the perception of (exposure to) a stimu- 
lus the greater the duplication of exposures. 
Therefore, the greater the number of persons 
who have been exposed to an advertising stimu- 
lus the greater number of multiple exposures 
among those who have been exposed at all, and 
by summation, the greater the conscious aware- 
ness of those who were prone to be confused.” 

Now it is true that the more widespread the 
perception of a stimulus (magazine), the greater 
the circulation of that magazine and therefore 
the greater the probability of one happening 
upon that magazine at a different time in a dif- 
ferent place. But we are concerned with the 
measurement of exposure (or confusion) to a 
specific advertisement within that medium. 
This writer feels that the widespread perception 
of an advertisement is not related to duplication 
of perception except in the case of outdoor ad- 
vertising media. More important is the very 
physical nature of the magazine itself. That 
is, there is a tendency for duplication of ex- 
posure to be related to size (amount of reading 
content) and a host of other factors such as 
interest and prestige. Even this remains to 
be verified, for this assumes that the more 
content in a magazine, the less the probability 
of a reader finishing and discarding it in one 
sitting, and therefore the greater the probabil- 
ity of the duplication of exposure. 

Another objection is that Monan’s technique 
does not account for deliberate liars, who are 
an important factor to eliminate in measuring 
the effect of advertisements. Do they also 
distribute themselves by chance, just as many 
falsely stating “yes” as falsely stating ‘“‘no’’? 
I doubt it. 


Moran does not appear to realize that the 
setup of his displays biases the results. He 
uses one single test advertisement (called X) 
and six previously published advertisements, 
some one and a half years old. The context or 
environment produced by these published ad- 
vertisements is suggestive and likely to inflate 
or bias the amount of confusion of X. A fair 
test of the Lucas? method is to use the Lucas 
method ; i.e., an equal number of published and 
unpublished advertisements. 

This brings me to my final and chief criticism 
of Moran’s method. It is his primary purpose 
to estimate the proportion of a sample which 
has been exposed to an advertisement. He 
accomplishes this through the use of an alge- 
braic equation which requires only a knowledge 
of the proportion of the sample which was not 
confused on the pretest and the proportion of 
the sample which responded affirmatively on 
the post test. His field testing consists of dis- 
plays employing pairs of advertisements rather 
than series of single advertisements. The 
amount of “‘measured’’ confusion can there- 
fore be changed by changing the degree of 
similarity between the advertisements within 
each pair. If the amount of confusion is 
changed, then also are the number of “‘meas- 
ured” exposures. But since we are trying to 
measure the number of exposures (which is con- 
stant at the specific instant we attempt to 
measure it) then it should not be liable to such 
changes. That it is, is an inadequacy of the 
measuring instrument. —~ . 


Received June 19, 1950 and 
published out-of-turn by editor. 
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A Reply to Heller’s Note 


William T. Moran 
Commercial Research Dept., Pillsbury Mills, Inc., Minneapolis, Minnesota 


Since Heller’s critique! is mainly concerned 
with the presentation of counter-assumptions 
to those in this writer’s article, it is difficult to 


''N. Heller. Moran’s “Measuring ot SS ee. to adver- 
tisements.” J. appl. Psychol., 1951, 77 


reply with concrete refutation devoid of polem- 
ics. However, the following responses may 
clarify this writer’s position in regard to the 
assumptions. 

If, as Heller states, it is true that more 
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guessers are correct than incorrect, Lucas’ 
adjustment is in even greater error, as is the 
Magazine Audience Group, since both methods 
make the implicit assumption that all con- 
fused responses are affirmative. In regard to 
the corollary embarrassment of finding over 
50 per cent of the sample responding affirma- 
tively on the pretest the mathematical possibil- 
ity certainly exists. This writer prefers to 
stand on the assumption, however, and sug- 
gests that any such empirical findings may well 
be founded upon the introduction of specific 
determiners to the experimental milieu. 

The plea for forcing the respondent to answer 
yes or no is irrefutable in the light of the 1948 
election. If the researcher succeeds, that 
portion of the algebraic manipulation may 
easily be ignored. Should a perverse inter- 
viewer present him with a don’t know ballot 
or two, however, there is an opportunity to 
avoid hiding the questionnaires. For that 
matter, in defense of don’t know responses, it 
appears that the problem of predicting one’s 
own future behavior is not psychologically 
equivalent to that of remembering past be- 
havior. 

The assumption of increasing duplication of 
exposure with increasing exposure to an ad- 
vertisement must, apparently, remain an as- 
sumption for the present. As for the failure 
to account, separately, for deliberate liars, this 


writer must confess to an incomplete discussion 
of advertising measurement. It would be 
intriguing to learn which of two paired ad- 
vertisements would elicit a greater desire to lie. 

The set-up of displays in the preliminary 
study most certainly introduces potential bias. 
It is not, however, discriminatory with refer- 
ence to Lucas’ method since this writer’s 
method is similarly distorted by the same cir- 
cumstance. 

Contrary to Heller’s chief criticism, 2 change 
in measured confusion on the pretest, induced 
by varying the similarity between paired ad- 
vertisements, does not change the number of 
measured exposures. True, the amount of 
measured confusion would differ but, similarly, 
the obtained amount of affirmative responses 
would differ on the post tests. So, the cor- 
rected values of per cent exposed would agree. 

Assuming on two pretests of the same 
experimental advertisement that confusion 
scores of (a) 40% and (b) 60% were obtained 
and assuming, further, that true exposure upon 
publication is known to be 30%, then by the 
writer’s assumptions post test affirmative re- 
sponses of (a) 40.7% and (b) 45.15% would 
be obtained. Working conversely, from the 
formula, the values of per cent exposed would 
agree. 


Received October 26, 1950 and 
published out-of-turn by editor. 








Book Reviews 


Kornhauser, A. (Ed.) Psychology of labor-man- 
agement relations. Champaign, IIl.: Indus- 
trial Relations Research Association, 1949. 
Pp. vi+ 122. $1.50. 

This volume represents Publication No. 3 of 
the Industrial Relations Research Association. 
It contains seven original papers which were 
presented at the Denver meeting of the A.P.A. 
in September, 1949, in a program sponsored 
jointly by the Industrial Relations Research 
Association, the Division of Industrial and 
Business Psychology of the A.P.A., and the 
‘Society for the Psychological Study of Social 
Issues. Part I, containing three papers, deals 
with the rdle of personnel psychology in im- 
proving labor-management relations. Viteles 
discusses psychological contributions in the 
field of employee selection and placement. 
Tiffin considers job evaluation, with special 
attention to the rdle of a continuous manage- 
ment-union committee. Maier discusses super- 
visory training, drawing upon his own indus- 
trial research on techniques designed to foster 
democratic rather than autocratic leadership. 
Experiments involving the use of réle-playing, 
risk-taking, group decisions, and other “‘per- 
missive” techniques are reported, and stress is 
laid upon the fact that problems in human re- 
lations training are primarily problems involv- 
ing attitude and personality change. The 
three papers are followed by brief discussions 
by Bellows, H. C. Taylor, Gomberg, and others. 

Part II begins with Katz’s paper on the 
attitude survey approach to the investigation 
of labor-management relationships. Katz 
calls attention to the need for broader surveys 
embracing the study of attitudes toward 
company and union structures, toward differ- 
ent levels within the management and union 
hierarchies, and toward past and present labor- 
management negotiations and _ conflicts. 
French and Zander present the “group dy- 
namics” approach, with illustrative experi- 
ments on group participation and group deci- 
sion conducted with industrial populations. 
The approach through clinical psychology is 
surveyed by McMurray, who discusses the con- 
sultant’s réle in personality assessment and 


touches upon some of the conflict-breeding 
personality make-ups often found among 
management and union leaders, The papers 
in Part II are discusséd by Stagner, Worthy, 
Kerr, and others. The volume concludes with 
Part III, consisting of a paper by McGregor in 
which he proposes that a theory of organized 
human effort might well be built around the. 
fact that “all human behavior is directed to- 
ward the satisfaction of needs,” or “tension- 
states” which are manifested in behavior. 

The general topics are well chosen, although 
it is undoubtedly true, as Kornhauser points 
out in his Introduction to the volume, that the 
specific topics selected in the personnel field 
are only illustrative and could be extended to 
include the effects of different aspects of work- 
ing conditions, job training, performance re- 
view, safety, fatigue, communications, and 
many other components of a total personnel 
or industrial relations program. The organi- 
zation of the present volume, moreover, is 
rather puzzling, since there appears to be no 
teal reason for the distinction between the 
three Parts. 

Although the reviewer would agree with 
Katz that many industrial “morale surveys” 
do not go far enough for research purposes, it 
should be pointed out that such studies are 
often conducted to get concrete answers to 
specific practical questions, rather than to do 
basic research. Similarly, we can agree with 
French and Zander that “nothing is so prac- 
tical as a good theory” (p. 71), although one 
can question their choice of a theoretical 
structure with such vague and often circular 
definitions of variables as those in “field the- 
ory.” Certain experiments reported by these 
authors are also open to criticism from the 
standpoint of methodology. In one such ex- 
periment on the relation between morale and 
the degree of participation in decision-making, 
it is stated that the three experimental groups 
“were matched for the difficulty of the job, 
for the amount of change to be made in the 
job, and for the level of productivity before the 
experiment” (pp. 73-74). The question im- 
mediately arises regarding the need for match- 
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ing these groups in other important variables, 
such as previous work experience, attitudes, 
abilities, motivation, and personality character- 
istics. Certainly, such factors would appear 
to be extremely important in influencing the 
effects of the experimentally-imposed tech- 
niques. 

Several of the participants in this Sympo- 
sium, such as Kornhauser, Stagner, and Shel- 
low, are critical of psychologists for identifying 
more closely with management than with 
labor. It is undoubtedly true, as Kornhauser 
maintains, that “by and large industrial psy- 
chologists have worked for management and 
have accepted management’s point of view” 
(p. 3). On the other hand, we must strongly 
disagree with Shellow regarding the reason for 
this situation. Her claim that “the unions 
have not been interested in psychology because 
psychologists have not been interested in the 
unions” (p. 60) hardly tells the whole story. 

Gomberg, who is certainly well qualified to 
speak for unions, makes a number of comments 
which throw considerable light upon the union 
attitude toward selection programs. He 
frankly states that the union is not interested 
in techniques for improving selection and in- 
creasing over-all productivity because of 
labor’s fear of resulting reduction in number of 
jobs and a depression of prevailing wage rates. 
Gomberg explicitly recognizes that this is a 
short-range attitude. Interestingly enough, 
however, he argues that “people live in the 
short run, marry in the short run, bring up 
their children in the short run, and develop 
neuroses and psychoses under the stress of in- 
security, all in the short run” (p. 52). The 
implication appears to be that the psychologist 
too, should be short-sighted along with the 
worker. As a further rationalization of the 
union attitude toward selection techniques, 
Gomberg criticizes the use of tests with validity 
coefficients no greater than, for example, .40. 
He refers to the “injustices” to workers who 
are unfortunate enough to be incorrectly classi- 
fied as “incompetent” by such an instrument. 
He completely overlooks, however, the more 
numerous “unfortunates” who are selected by. 
techniques with a validity of .00. Certainly, 
failure to achieve perfect correlation does not 
justify abandonment of the best available pre- 
dictors, whatever their limitations. 


Several of the participants stress the im- 
portant rdle of motives and attitudes as factors 
influencing worker productivity. Most agree 
that there is a need for more basic studies in 
this area—studies in which management and 
unions should codperate. Viteles emphasizes 
a very practical point, frequently overlooked 
in current studies, when he states that “good 
labor-management relations cannot be main- 
tained in the absence of productive efficiency 
on the part of employees” (p. 9). Similarly, 
Gomberg’s characterization of certain ‘“‘super- 
visory techniques” as essentially “manipula- 
tive” and his reference to “play-acting democ- 
racy” (p. 56) sound a healthy note of caution 
against the danger of superficiality in such an 
approach. Finally, Kornhauser stresses the 
desirability of differentiating between the psy- 
chologist’s rdle as a technologist bent upon 
solving a practical problem and his réle as a 
scientist seeking basic knowledge. Bellows 
and H. C. Taylor are also careful to distinguish 
between psychology as technology and psy- 
chology as science. 

Perhaps the chief significance of this volume 
lies in its reflection of an increased interest in 


viewing human relations problems from the 
differing perspectives of the social sciences. 
The symposium thus presents a codperative 
and multi-dimensional approach to the human 
factor in industry. 


John P. Foley, Jr. 


The Psychological Cor poration, 
New York, N.Y. 


Dorcus, Roy M., and Jones, Margaret Hub- 
bard. Handbook of employee selection. 
New York: McGraw-Hill Book Co., Inc., 
1950. Pp. xv + 349. $4.50. 


This book is intended as a reference book 
for persons interested in the use of tests for 
selecting employees. It is a compendium of 
abstracts of the essential data of various 
studies. It is non-critical in the abstract of 
each study, leaving the evaluation of the use- 
fulness of each investigation to the reader. 
In this reviewer’s opinion, the book satisfies a 
need and fulfills the authors’ intentions in an 
admirable manner. 

The abstracts are arranged chronologically 
and numbered in order. Almost all of them 
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are based on primary sources. A few are 
based on personal communication and on 
reputable secondary sources, and these are 
indicated as such. There is a job title index, 
an author index, and a test index. Each index 
refers to the abstracts by number so that the 
abstracts may be located quickly and easily. 

Each abstract is brief, being about half a 
page in length. The abstracts are outlined in 
terms of five categories, namely: (1) the type 
of job and number of subjects, (2) the test 
titles (or a brief description where non-stand- 
ardized tests are used), (3) the type of cri- 
terion of job proficiency, (4) the validity 
statistics, and (5) the reliability of the cri- 
terion, where this is reported. 

Although the abstracts are non-critical, 
rigid requirements had to be met for inclusion 
in this volume. That is, the original sources 
had to contain the information called for in the 
five categories just mentioned (although the re- 
liability of the criterion was not insisted upon 
since this would have excluded too many more 
abstracts). Tests for selection of military 
personnel, aviators, automobile drivers (except 
bus and taxi drivers), accident repeaters, and 
students have been omitted. Studies of 
“average” workers, not comparing good and 
poor workers within the occupation, were also 
omitted. Thus, although 2,100 references 
were examined, only 427 references met the 
criterion of minimum adequacy and are in- 
cluded here. 

The book may, of course be criticized for 
having omitted many references. On the 
other hand, the authors may be commended 
for having done an excellent job of what they 
started to do. Practically all of the useful 
references are included. And the elimination 
of scholastic and military studies does make the 
book more useful for those who are interested 
primarily in industrial testing. As the authors 
state, these scholastic and military references 
are not directly pertinent to the selection of 
civilian employees, and they are covered in 
other sources. The elimination of about 80 
per cent of the references that the authors in- 
vestigated is a reflection upon the inadequacy 
of original papers rather than upon the authors’ 
selection of papers. 

Another criticism is that there is not enough 
detail, particularly regarding the criteria used. 


For example, where output data are used as 
the criteria, the reader may want to know just 
what kind of data. Was an incentive system 
in operation on the jobs concerned? Were the 
operators producing at a high or low rate, 
for their industry? Under what conditions 
were ratings obtained? Can we be sure the 
ratings are valid? But once again, these criti- 
cisms are not very practical ones. If the 
answers to all questions of this kind were de- 
manded, the book could probably not be writ- 
ten. Or at least it would be only a pamphlet, 
and not a usable book. 

The third possible criticism is again a weak- 
ness of published results, and not of this book. 
That is, an inexperienced reader might con- 
clude that tests always “work.” There are 
few studies that report low or useless validities. 
This is again more truly a criticism of the 
original papers and not of this book. 

In conclusion then, this book is an excellent 
summary of the available sources. It is an 
extremely handy reference book that will pro- 
vide a convenient first step for anyone inter- 
ested in seeing what has been done in using 
tests for selected industrial jobs. It is heartily 
recommended as such. 

Harold F. Rothe 


Stevenson, Jordan & Harrison, Inc., 
Chicago, Illinois 


Parten, Mildred. Surveys, polls, and samples. 
New York: Harper and Bros., 1950. Pp. 
624. $5.00. 


The development of the sample survey 
method over the past twenty years or so has 
progressed without the expected accompani- 
ment of adequate textbook or monograph 
coverage of the field and its problems. Text- 
books which are available have tended, gener- 
ally, either to review specific research, to be 
manuals for one aspect of survey work, or to 
be composite texts in the fields of public 
opinion and communications. Parten has 
written the present volume to fill a “‘need for a 
new book which would bring together in con- 
venient form the current procedures used by 
population surveyors in such fields as market- 
ing, political opinion polling, government 
census, radio audience measurement, socio- 
economic assays, as well as in the more aca- 
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demic attempts of the social scientists to evalu- 
ate populations by questionnaires and related 
devices” (p. ix). 

It attempts, as a textbook, to present “the 
historical background of population surveying, 
pollisg, and sampling in the several active 
fields, and . . . to describe the significant cur- 
rert practices” (p. x). Asa manual, the book 
emphasizes “specific procedures which provide 
the pragtical answers to the numerous tech- 
nical problems arising at every stage of the 
survey operation” (p. x). Placing as it does 
primary emphasis on the survey method as a 
research tool, the book pays little attention to 
the large public opinion agencies, the author 
preferring to let the reader evaluate their work 
“by scrutiny of the procedures employed in the 
light of the facts of sound practice” (p. xi). 

That the writer has taken her job seriously is 
indicated by the scope of the finished product. 
Chapters cover social surveys and polls in the 
United States, pianning the survey procedure, 
methods of securing information, the role of 
sampling, organization and personnel of the 
survey, construction of the questionnaire, 
types of sampling, procedures for drawing 
samples, size of sample, interview procedures, 
mail questionnaire procedures, sources of bias, 
editing the data, coding the data, tabulating 
the data, evaluating the data and sample, and 
preparing and publishing the report. Step by 
step, Parten leads the reader through the vari- 
ety of problems which arise when a survey is 
attempted. Both the inexperienced student 
and the more experienced social scientist will 
find useful the sections on: Organization and 
Personnel of the Survey, which combines a 
description of procedures and a summary of 
survey experiences; Procedures for Drawing 
Samples, especially the discussion of various 
lists of respondents and defects of such lists; 
Interview Procedures, especially the instruc- 
tions to interviewers with regard to such prob- 
lems as location of the proper respondent and 
reduction of refusal rates; Coding the Data, 
especially the summary of classifications and 
normative data obtained used in U. S. Census 
surveys; and the Bibliography, which, with, 
1,145 references, is the most complete in this 
field. 

Many readers, including the reviewer, will 
be disappointed to find so little reference to the 


opinion survey work of Gallup, Roper, et al., 
either for illustrative purposes or for descrip- 
tions of technique. Preference is given, pre- 
sumably as a result of the author’s own experi- 
ences, to “census-type”’ surveys and survey 
data. While this restriction does not decrease 
the value of the book as a manual for the seri- 
ous research worker, it does reduce its value as 
a textbook for students interested primarily in 
the analysis of public opinion data. 

Covering as it does such a wide variety of 
problems in the space of 536 pages of text, the 
book becomes, in spots, a rather loose listing 
of points without adequate discussion or evalu- 
ations. In some sections (as in the formula- 
tion of questions) this defect is probably un- 
avoidable. Even so, Parten has, by collating 
a wide variety of information and “know-how” 
within the covers of a single book, made a 
notable contribution to survey sampling meth- 
ods. Her “how-to-do-it” recommendations 


may profitably be reviewed not only by the 
novice in the field, but by the practicing sur- 
veyor as well. 


Kenneth E. Clark 


University of Minnesota 


Soden, William H. (Ed.) Rehabilitation of the 
handicapped: A survey of means and methods. 
New York: The Ronald Press Co., 1949. 
Pp. xiii + 399. $5.00. 


The stated purpose of this volume is “‘to as- 
semble representative accounts of procedures 
in current use for the mental and physical re- 
habilitation of persons disabled by illness or 
injury or otherwise handicapped”’ (p. vii). 

Reports of the current work of rehabilitation 
teams tend to be scattered throughout the 
many publications representing the particular 
specialties and interests of contributing team 
members. An accessible collection of current 
procedures typical of the broad field of rehabil- 
itation would fill a distinct need. 

The question to be answered, then, is: How 
representative of the entire field of rehabilita- 
tion is this volume? The answer depends on 
one’s conception of the rehabilitation process. 
Rehabilitation of the disabled is based on two 
main processes, physical restoration and voca- 
tional rehabilitation. Physical restoration 
makes use of many specialized medical skills 
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plus a host of ancillary medical services such as 
nursing, occupational and physical therapy, 
medical and psychiatric social work, as well as 
services available in curative workshops and 
rehabilitation centers. Vocational rehabilita- 
tion may involve clinical psychology, voca- 
tional counseling, education, social work and 
the technique of selective job placement as well 
as the use of community educational, place- 
ment and social service resources. 

This volume contains 38 articles, 21 of 
which are devoted to some aspect of medical 
rehabilitation, classified under the headings of 
general medical and surgical technics, neuro- 
logical methods, and psychiatric developments. 
The remaining seventeen articles are distrib- 
uted under two rubrics; vocational and social 
rehabilitation and educational and psycho- 
logical trends. Two articles in the medical 
group discuss the role of nursing in psycho- 
surgery and psychiatric rehabilitation and two 
more are general discussions of the rehabilita- 
tion of the poliomyelitic and tuberculous. 
All others are technical papers discussing 
specific medical treatments primarily of inter- 
est to specialists. 

The papers grouped under vocational and 
social rehabilitation describe the operations of 
community workshops and rehabilitation cen- 
ters or the programs of official and voluntary 
national rehabilitation agencies. Included in 
this group, however, is an article on the role of 
social service in rehabilitation. The remain- 
ing section on educational and psychological 
trends contains an analysis of disability among 
600 disabled veterans, two articles on trends in 
occupational therapy and articles on the evalu- 


ation of the training cf physical education , 


majors for work in reconditioning, the psy- 
chological aspects of rehabilitation and public 
relations. 


An evaluation of this volume in the light of 


the question posed leads to the conclusion that 
important segments of the rehabilitation field 
have not been adequately covered. The re- 
habilitation worker who expects to find con- 
tributions devoted to counseling, vocational 
guidance, vocational training and selective 
job-placement will be disappointed. Never- 
theless, the volume is characterized by a con- 
scious awareness on the part of the contribu- 
tors of the subjective factors present in the 


restoration of any particular individual. Fur- 
thermore, all of them are aware of the need for 
coordinating the contribution of each specialty 
as part of the rehabilitation team. There is a 
wealth of material in the volume which will 
prove useful to those interested in physical 
restoration. 
M. E. Odoroff 
Federal Security Agency, 
Public Health Service : 

Konopka, Gisela. Therapeutic group work 

with children. Minneapolis: Univ. Minne- 

sota Press, 1949. Pp. 134. $2.50. 


This book gives a simple, straightforward ac- 
count of Mrs. Konopka’s contacts with two 
groups of children: (1) a group of delinquent 
boys committed to a reception center for ob- 
servation and treatment, and (2) a group of 
emotionally disturbed girls referred for out- 
patient study and treatment to a child guid- 
ance clinic. The two sets of case-records are 
preceded by a brief, practical introduction to 
“the group work method” with emphasis on 
the need for providing constructive, remedial 
group activities in working with delinquent 
youngsters who have been institutionalized. 
Mrs. Konopka sees the group worker as a 
member of a clinical team whose “main tools” 
are: (1) knowledge and use of his own person- 
ality, (2) understanding of group relations, 
and (3) imaginative and skillful use of “pro- 
gram.” A postscript at the end of the book 
emphasizes the role of the group worker in ob- 
serving, tying together, and making construc- 
tive use of seemingly unrelated “‘small events” 
which occur in the group situation. 

Therapeutic Group Work with Children has 
many commendable features. Among recent 
books and articles dealing with “group process” 
and “group therapy,” Mrs. Konopka’s book is 
singularly devoid of professional jargon and of 
dogmatic, sweeping generalization. She is en- 
thusiastic without being defensive. Her re- 
ports of the behavior of boys and girls in group 
meetings are vivid and alive. Mrs. Konopka’s 
sensitivity to the attitudes and feelings of her 
clients and her resourcefulness in applying a 
variety of group skills offer helpful study ma- 
terials. One is struck by the “normal” be- 
havior of emotionally upset youngsters in the 
therapeutic group situations. 
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The book has its limitations; e.g., it does not 
provide a definitive, comprehensive discussion 
of group methods in therapy. There is only 
scant and spotty reference to the literature on 
social and recreational group work which forms 
the background of Mrs. Konopka’s own activi- 
ties; there is no attempt to define these areas in 
relation to fields such as social anthropology, 
sociology, social and clinical psychology, and 
psychiatry. The psychologist may criticize 
both Dr. Lippman’s and Mrs. Konopka’s loose 
use of the term “experiment” to describe her 
study, Mrs. Konopka’s failure to point out the 
fallibility of report by a single participant- 
observer and to distinguish between observa- 
tion and inference in her summaries of the 
group meetings, and the inadequacies of her 
case histories and progress reports which “are 
included at the end to help the reader analyze 
the effectiveness of group work in diagnosis 
and treatment” (p. 93). 

Therapeutic Group Work with Children will 
have value for the practitioner who has immedi- 
ate need for a practical introduction to the-use 
of group methods. Such methods, however, 


should be supported by evidence from coopera- 
tive studies of individuals in groups by the 


skilled practitioner and experimentalists with 
varied backgrounds. 

It should be reemphasized that Mrs. Kono- 
pka has been quite modest in her claims about 
her work. Apart from limitations of theoreti- 
cal framework and research method, I found 
her little book delightful and informative. 


Harold B. Pepinsky 
The State College of Washington 


Board of Examiners, Michigan State College. 
Comprehensive examinations in a program of 
general education. East Lansing, Michigan: 
Michigan State College Press, 1949. Pp. 
165. $4.00. 

This is more than a manual of procedures 
for the construction of objective tests. It pre- 


sents an account of a program of general edu- 
cation, with emphasis upon the place of eval- 
uation in such a program. It indicates in 
detail the form taken by the comprehensive 
examinations, with due attention to educa- 
tional philosophy, administrative organization, 
and personnel problems. 

In the general education program at Michi- 
gan State College, the comprehensive examina- 
tions are arranged by specialists, in cooperation 
with other members of the teaching staff. 
Marks received by students are therefore not 
primarily controlled by the instructors in the 
courses. Since this places a heavy responsibil- 
ity upon the Board of Examiners, it has re- 
sulted in taking great care to provide adequate 
examinations in such various fields as the social 
studies, biological science, physical science, 
history, literature and the arts, and English. 
The specialists in the different subjects have 
constructed their tests using modern objective 
methods. The reactions of faculty members 
and of students to the testing program have 
been carefully reported in one of the chapters 
of the book. 

Educational objectives as stated in any edu- 
cational program need to be clarified in the 
light of evaluation. Comprehensive testing, 
rather than testing upon small units of as- 
signed work, is desirable. The truth of both 
these principles has long been recognized, but 
in common practice both principles are ig- 
nored. The testing program at Michigan 
State College furnishes an exception, and this 
book which reports upon it can have great 
value. The attempt to indicate in some detail 
the form taken by new-type tests in one gen- 
eral education program provides a basis for 
critical reaction by scholars. One may hope 
that it will lead to further constructive activity. 
The book should be read by all college pro- 
fessors. 

Harold D. Carter 

University of California, 

Berkeley, California 
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